Transactions — 
: of the |-R-E 


RvE 
Professional Group on, 


INFORMATION THEORY 


PGIT-4 SEPTEMBER 1954 


1954 SYMPOSIUM ON INFORMATION THEORY 
held at 
Massachusetts Institute of Technology 
Cambridge, Massachusetts 


September I5-17, 1954 


e 


The Institute of Radio Engineers — 


IRE PROFESSIONAL GROUP ON INFORMATION THEORY 


The Professional Group on Information Theory is an organization, within the 
framework of the IRE, of members with principal professional interest in Infor- 
“mation Theory. All members of the IRE are eligible for membership in the_ 
Group and will receive all Group publications upon payment of prescribed 
assessments. . 


Annual Assessment: $2.00 


Administrative Committee 


Chairman: WuitiiAM G. Tutier, Melpar, Inc., Alexandria, Va. 
Vice Chairman: Louis A. pDERosA, Federal Telecommunications 
fide vete Laboratories, Inc., 5S RN oot ee aot . bi 
Nutley, N. J. 
Secretary: Harotp R. Hottoway, Sylvania Electric Products, Inc., 
Bayside, L. IL, N. Y. 


R. M. Fano NatHAN MarcHANp 
Research Laboratory of Electronics Marchand Electronics 
Massachusetts Institute of Technology Greenwich, Connecticut 


Cambridge 39, Massachusetts 


oF hpi WinsLow PALME 
M. J. E. Goray SEOw i 


Sperry Gyroscope Company 


Squier Signal Laboratory é 

Z Great k Yi 

Fort Monmouth, New Jersey at Neck, New York 

C. H. Pace Witeur B. Davenport, Jr. 

National Bureau of Standards Lincoln Laboratories 

Washington 25, D. C. Massachusetts Institute of Technology 
jambridge assac , 

M. J. DiToro Cambridge 39, Massachusetts 


Fairchild Guided Missile Laboratory 


Wyandanch, Long Island, New York EACH Gon aS CHER 


Federal Telecommunications Labs., Ine. 


Meyer LEIFER Nutley, New Jersey 

Electronic Defense Laboratory 

Syvania Electronic Products, Ine. Ernest R. KretzmMer 

Mountain View, P.O. Box 205, Calif. Bell Telephone Laboratories 
Murray Hill, New Jersey 

W. D. Wuite 

Airborne Instruments Laboratory, Ine. Bernarp M. Ouiver 

160 Old Country Road Hewlett-Packard Corporation 

Mineola, New York Palo-Alto, California 


TRANSACTIONS OF THE IeReE® 
Professional Group on Information Theory 


Published by the Institute of Radio Engineers, Inc., for the Professional Group 
on Information Theory at 1 East 79th Street, New York 21, N. Y. Responsibility 
for the contents rests upon the authors, and not upon the Institute, the Group 
or its members. Individual copies available for sale to IRE-PGIT members at 
$3.35; to IRE members at $5.00; and to non-members at $10.00. 


Copyright, 1954 — Tue Instirute or Rapio Encinerrs, Inc. 


All rights, including translation, are reserved by the Institute. Requests for republication privi- 
leges should be addressed to the Institute of Radio Engineers, 1 E. 79th St., New York 21, N. Y 


TRANSACTIONS 
of the 
1954 SYMPOSIUM ON INFORMATION THEORY 
held at 
Massachusetts Institute of Technology, Cambridge, Massachusetts 
September 15-17, 1954 


Organized by 


The Professional Group on Information Theory, Institute of Radio Engineers 


In cooperation with 


The Research Laboratory of Electronics, Massachusetts Institute of Technology 


and sponsored by 


The American Institute of Electrical Engineers 
The International Scientific Radio Union (URSI) 
The Office of Naval Research 
The Signal Corps Engineering Laboratories 
The Air Research and Development Command 


Organizing Committee 
R. M. Fano, Chairman 


T. P. Cheatham D. A. Huffman W. F. Potter 

W. B. Davenport W.H. Huggins W. A. Rosenblith 
B. Dudley Y. W. Lee R. A. Sayers 

P. Elias A. J. Poté J. B. Wiesner 


Digitized by the Internet Archive 
in 2023 with funding from 
Kahle/Austin Foundation 


https ‘/larchive.org/details/ieee-transactions-on-information-theory 1 954-09 _pgit-4 


CONTENTS AND ABSTRACTS 


Preface - W. G. Tuller 
CODING I, Chairman, J, B. Wiesner 


A New Basic Theorem of Information Theory", by A. Feinstein 


- A new theorem for noisy channels, similar to Shannon's in its 
general statement but giving sharper results, is first formlated 
and proven. It is then shown that the equivocation of the channel 
defined ty the theorem vanishes with increasing code length. The 
remaining sections are devoted to various generalizations; in 
particular, to defining a continuous channel in a manner that 
permits the application to it of the results given above. The 
detailed proof of the equivalence of this definition and Shannon's 
is given in an Avpendix. 


"Binary Coding", by M. J. B. Golay 


It is shown that efficient binary symbol coding with 2-error 
corrections is impossible, and, more generally, that a search for 

. efficient e-error symbol coding need only be a finite one, as an 
upper bound exists beyond which denomstrable impossibility sets in. 


The question remains open whether there exists more than one case 

of efficient binary message coding with more than one error correction, 
and the possibly greater fruitfulness of this epproach is suggested by 
a few cases of inefficient coding where more information can be trens- 
mitted by message coding than by symbol coding. 


"Brror-Free Coding", by P. #lias 


Some simple constructive procedures are siven for coding sequences 

of symbols to be transmitted over noisy channels. A message encoded 
by such a process transmits a positive amount of information over the 
channel, with an error probability which the receiver may set to be as 
smé@ll as it pleases, witheut consulting the transmitter. The amount of 
information transmitted is less than the chennel capacity, so the pro- 
cedures are not ideal, but they ere quite efficient for small error 
probabilities. : 


CODING II, Chairman, W. G. Tuller 


@ 
"A Class of Multiple-Srror-Correcting Codes and the Decoding Scheme", 
by I. S. Reed 


A procedure for constructing one-error-correcting and two-error-detecting 
systematic codes has’ been introduced by R. W. Hamming. Some examples of 
n-error-correcting and (nt1) error-detecting systematic codes for the 
cases where both the code length and n+l are powers of two are presented. 
The decoding scheme presented in this report differs from Hanmming's 
scheme in that the encoded message will be extracted directly from the 
possibly corrupted received code by a majority testing of the redundant 
relations within the code. 


iii 


CODING II, continued. 


"Coding for Constant-Data-Rate Systems", by R.A. Silverman and M. Balser 


This paper discusses the use of error-correcting codes in reducing the 
error rate in communication systems which transmit data at a constant 
rate. Anew single-error-correcting code (the Wagner code) is described 
and analyzed. Its performance is compared with that of Hamming's single- 
error-correcting code, and is found to be superior for many communication 
applications. The principle of the Wagner code is then used to construct 
two new miltiple-error-correcting codes, whose performance is compared 
with that of Reed's mltiple-error-correcting code. If the criterion of 
performance is that the frequency of errors in sequences of m binary 
digits be as small as possible, it is found that each of the multiple- 


error-correcting coder is especially suited to a certain range of values 
of m. 


INFORMaTION AND ORGANIZATION, Chairman, B. McMillan 


"Informetion, Organization and Systems", by J. Rothstein 


The object of this paper is to develop and apply a mathematical concept of 
organization and of systems. It is very closely related to the information 
concept and provides the link whereby the theorems of communication theory 
become generalized and applicable to systems in general. Brief applications 
are given to system reliubility, the significance of organization theory for 
circuit design, and production and quality control from a systems viewpoint. 


‘An Information-Theoretical Model of Organizations", by M. Kochen 


The description of certain organizational systems is axiomatically formalized. 
Such systems ere regarded, relative to an outside observer, as groups of 
participants, capable of selecting from a set of alternatives, such as to 
maximize the vaiue to themselves, according to a subjective scale; each 
menber is able to store, subject to limited storage cupacity, the commnicated 
choices from one or more others, in suitably coded form. 


The only data assumed to be available to each participant is a matrix of the 
encoded choices of others, and tho value accruing from the combination, for 

a sample of several time periods. In terms of this, expressions for efficiency 
and order of organization are obtained which agree with established results 

for large samples. Communication patterns, application of Shannon's 
fundamental theorem, the temporal behavior of systems in special cases, and 
some illustrative exatiples and applications are studied. 


"Simalation of Self-Organizing Systems by Digital Computer", by B. G. Farley and 
W. A. Clark 


A general discussion of ideas and definitions relating to self-orrenizing 
systems and their synthesis is given, together with remarks concerning their 
simulation by digital computer. Synthesis and simulation of an actual system 
is then described. This system, initially randomly orgenized within wide 
limits, erganizes itself to perform a sixple prescribed task. 


iv 


50 


6h 


67 


76 


ANALYSIS AND RSTRIGVAL OF INFORMATION, Chairman, L. A. deRosa 


"A Study of Ergodicity and Redundancy based on Intersymbol Correlation 
of Finite Range", by S. Watanabe 85 


Some of the basic concepts of information theory are critically reviewed 
in the light of a generalized formulation of the theory cf Markoff's 
chains, in which the initial and final states are sequences of symbols 
of different isngths, and occurrence of symbols is governed by inter- 
symbol correletion probability of finite range. In particular, the 
conditions of ergodicity and the structure of "ergodic subsets" of 
sequences of arbitrary length are carefully discussed. A mathematical | 
method is developed to determine the ‘range" and "strength" of inter- 
symbol correlation. A brief summary of the content is given at the end 
of Section l. ; 


"Multivariate Information Transmission", by W. J. McGill 93 


A mitiveriate analysis besed on transmitted information is presented. 

It is shown that sample transmitted information provides a simple method . 
for measuring and testing association in multidimensional contengency 
tables. Relations with enalysis of variance are pointed out, and 
statistical tests are described. 


"Choice and Coding in Information Retrieval Systems", by C. N. Mooers 112 


Information retrieval machines are devices for indexing and selecting 
information in a library. The operation of these machines is based upon 
some sort of an arbitrary code system, in terms of which the machines! 
operations are defined. This paper applies the formalism of communication 
theory es developed for signalling to machines for information retrieval. 
Three topics are discussed in this paper. 1) The retrieval system analogue 
to H = -<p log P;> the measure of the output of a source, is developed. 
2) The retvieval system analogues to synchronous and asynchronous mltiplex 
types of coding are described ond channel capacities are discussed. For the 
latter type of coding, called "superimposed", a new limit on "channel 
capacity" of log 2 bits per site is given. 3) Selection errors due to coding 
are discussed, and it is shown that the frequency of errors can be made ; 
arbitrarily small for the superimposed type of coding in analogue to the 
result for signalling. 

DaTuCTION AND PREDICTION, I , Chairman E. Weber 

"Modern Statistical Approaches to Reception in Communication Theory", by 


D. Van Meter and D. Middleton 119 


When reception in the theory of commnication is recognized as a problem in 
statistical inference, system design and system analysis appear as the 
counterparts of designing and evaluating statistical tests. This paper 
discusses the optimum properties of designs based on statistical decision 
theory from the risk point of view, and from that of information theory. 
Connections between risk and information loss are established, which 

result in a unified theory of system design. This includes Minimax methods 
capable in principle of handling all degrees of a priori knowledge of signal 
and noise statistics, new methods for comparing actual and ideal systems for 
the same purpose, and new interpretations of previously used formulaticns as 
special cases of the more general theory. Both detection and extraction of 
signals in noise are considered, the former as a problem of testing 
statistical hypotheses end the latter as one of estimating parameters. 


"A Non-Linear Prediction Theory", by R. F. Drenick oe 


The paper deals with the smoothing and the prediction of certain 
signals in noise. It ig; more particularly, a study of the con- 
ditions under which the optimum sampling filters obtained from 
that theory are non-linear, and whet improvement in performance 
can be expected. The theory is restricted to the case of signals 


DSTECTION 


"The 


DETECTION 


"The 


"The 


"The 


AND PReDICTION I, continued 


which are representable over a reasonable period by a polynomial 
in time. It is found that non-linear prediction filters result, 
in general, when the noise is non-Gaussian, regardless of error. 
criterion. Rules are established for the synthesis of such 
filters. The performance is calculated for a Ato oe case and 
indicates considerable improvement. 


Detection of Signals Perturbed by Scatter end Noise", by R. Price 


The functional form of a probability commuting receiver is 

determined for the case of signals perturbed both by transmission 
through a "scatter" channel and by the addition of gaussian noise. 

The "scatter" channel considered here takes the form of a complex 
multiplicative random process. In general, the receiver computations 
involve the operations of linear transformation and matrix inversion. 
However, in the case of small signal-to-noise ratios a considerable 
simplification results and a practical receiver structure is obtained. 


AND PRSDICTION II, Chairman, L. G. Abraham 


Theory of Signal Detectability", by W. W. Peterson, T. G. Birdsall, 
and W. C. Fox. 


The problem of signal detectability treated in this peper is the 
following: Suppose an observer is given a voltage varying with time 
during a prescribed observation interval and is asked to decide whether 
its source is noise or is signal plus noise. What method should the 
observer use to make this decision, and what receiver is a realization 
of that method? After giving a discussion of theoretical aspects of 
this problem, the paper presents specific derivations of the optimum 
receiver for a number of cases of practical interest. 


Human Use of Information.il+ Signal Detection for the Cese of the Signal 
Known Exactly", by Wilson P. Tanner, Jr., and John A. Swets 


A theory of visual detection is developed, based on the model provided by 
the theory of signal detectability, and, more generelly, by the theory of 
statistical decision. Two experiments are reported which test some pre- 
dictions of the theory for the case of the signal-known-exactly. These 
experiments demonstrate that the human observer tends toward optimum 
behavior, where optimam behavior is defined as that behavior which 
maximizes the expected gain from the decision. Their results show the 
proportion of correct detections to be dependent upon the proportion of 
false alarms; they indicate that neural activity is a power function of 
signal intensity. The data also demand a reevaluation of the fhreshold 
concept. Predictions are made for the date obtuined using two different 
methods of response, forced-choice and yes-no, and the internal consistency 
of the theory is demonstrated. The predictions of the theory are compared 
with contrasting predictions of conventional sensory theory; the data are 


at 


Human Use of Information. II: Signal Detection for the Case of an 
Unknown Signal Parameter", by W.P. Tanner, Jr., and R.Z. Norman 


Two specific cases of signal detection involving uncertainty in the 
frequency of a sound signal are compared with the case of the signal- 
known-exactly. In the first case the signal is either of two known 
frequencies; in the second case the signal is any frequency within a 
given range. It is suggested that detection behavior that is optimal 
for the three cases requires a dual mechanism: a combination of a 
wide-open receiver end a panoramic receiver. SBvidence is presented 
that supports the existence of such a mechanism. E3stimates of the 
bandwidth and scan-rate of the receiver are included. 


163 


171 


213 


222 


PREFACE 


The 195) Symposium on Information Theory is the regular yearly 
symposium organized by the Professional Group on Information Theory 
and held alternately on the East and West Coasts. 


In view of the widespread interest in information theory, the 
PGIT has invited other organizations to join in the planning and 
sponsoring of the symposium so as to make it the one outstanding 
occasion for presentation and discussion of the most recent, signi- 
ficant advances made in the field. 


We are most happy to list among the sponsors of this symposium 
our sister organization, the American Institute of Electrical Engineers, 
and the International Scientific Radio Union (URSI). We are particu- 
larly thankful to the Research Laboratory of Electronics of the Masss- 
chusetts Institute of Technology for serving as host to the Symposium, 
and to the Office of Naval Research, the Air Research and Development 
Command, and the Signal Corps Engineering Laboratories for the finan- 
cial support provided by them through their joint contract with the 
Research Laboratory of Electronics. 


‘The TRANSACTIONS are published prior to the Symposium to allow 
the participants to familiarize themselves with the papers. It is 
hoped that the spirited discussion that will undoubtedly arise from 
a better-informed audience will generate new ideas and stimulate 
future research. 


Meeting the schedule required to publish these TRANSACTIONS has 
subjected many people to pressure and inconveniences. We are very 
grateful to the individual authors for meeting umisually stringent 
deadlines; to the organizing committee for the careful planning and 
review of papers; and to the staff of IRE Headquarters for the prompt 
publication and distribution of these TRANSACTIONS. 


W. G. Tuller, Chairman 
Professional Group on Information Theory 
Institute of Radio Engineers 


A NEW BASIC THEOREM OF INFORMATION THEORY % 


Amiel Feinstein ; : 
Research Laboratory of Electronics, Massachusetts Institute of Technology 
Cambridge, Massachusetts 


INTRODUCTION 

Information theory, in the restricted sense used in this paper, originated 
in the classical paper of C. E. Shannon, in which he gave a precise mathema- 
tical definition for the intuitive notion of information. In terms of this defini- 
nition it was possible to define precisely the notion of a communication channel 
and its capacity. Like all definitions that purport to deal with intuitive concepts, 
the reasonability and usefulness of these definitions depend for the most part 
on theorems whose hypotheses are given in terms of the new definitions but 
whose conclusions are in terms of previously defined concepts. The theorems 
in question are called the fundamental theorems for noiseless and noisy 
channels. We shall deal exclusively with noisy channels. 

By a communication channel we mean, in simplest terms, an apparatus for 
signaling from one point to another. The abstracted properties of a channel 
that will concern us are: (a) a finite set of signals that may be transmitted; 
(b) a set (not necessarily finite) of signals that may be received; (c) the prob- 
ability (or probability density) of the reception of any particular signal when 
the signal transmitted is specified. A simple telegraph system is a concrete 
example. The transmitted signals are a long pulse, a short pulse, and a pause. 
If there is no noise in the wire, the possible received signals are identical with 

‘the transmitted signals. If there is noise in the wire, the received signals will 
be mutilations of the transmitted signals, and the conditional probability will 
depend on the statistical characteristics of the noise present. 

We shall now sketch the definitions and theorems mentioned. Let X 
be a finite abstract set of elements x, and let p( ) be a probability distri- 
bution on X. We define the "information content" of X by the expression 


-= p(x) log, p(x), where the base 2 simply determines the fundamental unit of 
Xx 


’ information, called "bit". Ome intuitive way of looking at this definition is to 
consider a machine that picks, in a purely random way but with the given prob- 
abilities, one x per second from X. Then -log, p(x.) may be considered as 


the information or surprise associated with the event that Xo actually came up. 


If each event x consists of several events, that is, if x = {a, bye oh we have 
the following meaningful result: H(X) < H(A) + H(B) +... with equality if, and 
only if, the events a,b,... are mutually independent. 


We are now in a position to discuss the first fundamental theorem. We 
set ourselves the following situation. We have the set X. Suppose, further, 
that we have some "alphabet" of D "letters" which we may take as U,...,D-1. 
We wish to associate to each x a sequence of integers U,...,D-1 in sucha 
way that no sequence shall be an extension of some shorter sequence (for 
otherwise they would be distinguishable by virtue of their length, which 
amounts to introducing a Dein "variable"). Now it is easy to show that a set 
of D elements has a maximum information content when each element has the - 
same probability, namely 1/D. Suppose now that with each x we associate a 
sequence of length N,. The maximum amount of information obtainable by 
"specifying" that sequence is Ne log, D bits. Suppose N, log, D = -log, p(x); 
then z p(x) N, = H(X)/log, D is the average length of the sequence. The first 


*This work was supported in part by the Signal Corps; the Office of Scientific Research 
Air Research and Development Command; and the Office of Naval Research. 


, 


2 


fundamental theorem now states that if we content ourselves with representing 
sequences of x's by sequences of integers U,...,D-1, then if we choose our 
x-sequences sufficiently long, the sequences of integers representing them 
will have an average length as little greater then H(X)/log, D as desired, but 
that it is not possible to do any better than this. 

To discuss the second fundamental theorem, we now take, as usual, X to 
be the set of transmitted messages and Y the set of received signals. For 
simplicity we take Y finite. The conditional probability mentioned above we 

_denote by p(y/x). Let p( ) be a probability distribution over X, whose meaning 
is the probability of each x being transmitted. Then the average amount of 


information being fed into the channel is H(X) = -= p(x) log, p(x). Since in gen- 
x 


eral the reception of a y does not uniquely specify the x transmitted, we 
inquire how much information was lost in transmission. To determine this, 
we note that, inasmuch as the x was completely specified at the time of trans- 
mission, the amount of information lost is simply the amount of information 
necessary (on the average, of course) to specify the x. Having received y, 
our knowledge of the respective probability of each x having been the one 
transmitted is given by p(x/y). The average information needed to specify x 


is now - = p(x/y) log, p(x/y). We must now average this expression over the 
x 
set of possible y's. We obtain finally 


> Ply) Ee p(x/y) log, Po</y)) =e y D P(x, y) logy p(x/y) = H(X/Y) 
Y x x 


often called the equivocation of the channel. The rate at which information is 
received through the channel is therefore R 1H(X)-H(X/Y). A precise state- 


ment of the fundamental theorem for noisy channels is given in section II. 


I. For the sake of definiteness we begin by stating a few definitions and sub- 
sequent lemmas, more or less familiar. 

Let X and Y be abstract sets consisting of a finite number, « and 6, of 
points x and y. Let p( ) be a probability distribution over X, and for each 
x€X let p( /x) denote a probability distribution over Y. The totality of objects 
thus far defined will be called a communication channel. ; 

The situation envisaged is that X represents a set of symbols to be trans- 
mitted and Y represents the set of possible received signals. Then p(x) is the 
a priori probability of the transmission of a given symbol x, and p(R/x) is the 
probability of the received signal lying in a subset R of Y, given that x has 


been transmitted. Clearly, = p(x) p(R/x) represents the joint probability of 
x€Q 
R anda subset Q of X, and will be written as p(Q, R). Further, p(X, R) = p(R) 


represents the absolute probability of the received signal lying in R. (The use 
of p for various different probabilities should not cause any confusion. ) 
The "information rate" of the channel "source" X is defined by H(X) =. 


- = p(x) log p(x), where here and in the future the base of the logarithm is 2. 
Xx 


The "reception rate" of the channel is defined by the expression 


P(x, y) 
ee (Xa) LOG ee 0 
SLY. p(x) p(y) 


If we define the "equivocation" H(X/Y) = - = = p(x, y) logp(x/y) then the recep- 
»S > 


tion rate is given by H(X) - H(X/Y). The equivocation can be interpreted as 
the average amount of information, per symbol, lost in transmission. Indeed 
we see that H(X/Y) = 0 if and only if p(x/y) is either 0 or 1, for any x,y, that 
is, if the reception of a y uniquely specifies the transmitted symbol. When 
H(X/Y) = 0 the channel is called noiseless. If we interpret H(X) as the aver- 
age amount of information, per symbol, required to specify a given symbol of 
the ensemble X, with p( ) as the only initial knowledge about X, then 
H(X) - H(X/Y ) can be considered as the average amount, per symbol trans- 
mitted, of the information obtained by the (in general) only partial specification 
of the transmitted symbol by the, received signal. 

Let now u(v) represent a sequence of length n (where n is arbitrary but 
fixed) of statistically independent symbols x(y), and let the space of all 
sequences be denoted by U(V). In the usual manner we can define the various 
"product" probabilities. The n will be suppressed throughout. It is now 


simple to verify the following relations: 


n 


logp(u) = ¥ logp(x;), where u - {x,. A: (1) 
i=l . 

n 

logp(u/v) = > log p(x; /y;), where v = ty a Wien? ya (2) 
i=] 

H(X) = - = > p(u) logp(u) (3) 
U 

H(X/Y¥) = - 2 ” 3 p(u, v) logp(u/v) (4) 


The weak law of large numbers at once gives us the following lemma, which 
is fundamental for the proof of Shannon's theorem (see also section V). 
LEMMA 1. For any €,6 there is an n(€,8) such that for any n > n(€, &) the 
set of u for which the inequality | H(x) + (1/n) log p(u)| < € does not hold has 
p( ) probability less then 6. Similarly, but with a different n(€,6&), the set 
of pairs (u,v) for which the inequality | H(x/y) + (1/n) log p(u/v)| < € does not 
hold has p( , ) probability less than 6. 


In what follows we shall need only the weaker inequalities p(u)< 2~n(H(X)-€) 


and p(u/v) > 2-n(H (X/Y)+€), 


be denoted by & and &*, respectively. 


The probability of these inequalities failing will 


The following lemma is required to patch up certain difficulties caused by 
the inequalities of lemma. 1! failing to hold everywhere. 
LEMMA 2. Let Z bea (u,v) set of p( , ) probability greater than 1 - 8) and 
U, a set of u with p(U,) > 1-68,. For each u € U let A, be the set of v's such 
that (u, Oe) €Z. Let U, C U, be the set of ue U, for which p(A,,/u) Ss ee 
Then p(U,) > Vetoes -(6 ,/2). | 
PROOF. Let U, be the set of u- for which p(AC/u) >a, where AY is the com- 
plement of AY Then P(u, As) > ap(u) forue€ Us, and va (u, Ay) is, by the 
definition of Ay outside Z. Hence ; ' 


1 
5) > 2 p(u, AC) > ap(U,), or p(U,) <> 
2 


Thus p(U, - U,) < 6,/a and, using U, = U, - U,- U2, we have 


ao 


1 
p(U,) = pws) - p(U, . U,)> 1-6, ce 


II. We have seen that, by our definitions, the average amount of information 
received, per symboi transmitted, is H(X) - H(X/Y). However, in the process 
of transmission an amount H(X/Y) is lost, on the average. An obvious question 
is whether it is, in some way, possible to use the channel in such a manner that 
the average amount of information received, per symbol transmitted, is as 
near to H(X) - H(X/Y) as we please, while the information lost per symbol is, 
on the average, as small as we please. Shannon's theorem asserts (1), essen- 
tially, that this is possible. More precisely, let there be given a channel with 
rate H(X) - H(X/Y). Then for any e > 0 and H < H(X) - H(X/Y) there is an 
n(e,H) such that for each n > n(e,H) there is a family {u,} of message sequences 
(of length n) of number at least rade and a probability distribution on the {u,} 
such that, if only the sequences {u,} are transmitted, and with the given prob- 
abilities, then they can be detected with average probability of error less than 
e. The method of detection is that of maximum conditional probability, hence 
the need for specifying the transmission probability of the {u,}. By average 
probability of error less than e is meant that if ei is the fraction of the time 
that when u; is sent it is misinterpreted, and P; is u's transmission probabil- 
ity, then = ©;P; S Gc 

A Oe reient condition (2) for the above-mentioned possibility is the 
following: 

For any e > 0 and H < H(X) - H(X/Y) there is an n(e, H) of such value that 
among all sequences u of length n >n(e,H) there is a set {u,}. of number at 
least 2nH such that: . 

1. to each u, there is a v-set B, with p(B,/u,) > 1 - e 


2. the B; are disjoint. 


What this says is simply that if we agree to send only the set {u,} and always 


assume that, when the received sequence lies in B;, u, was transmitted, then 


we shall misidentify the transmitted sequence less than a fraction e of the 
time. As it stands, however, the above is not quite complete; for, if C is the 
largest number such that for H < C there is an n(e, H) and a set of at least 2nH 
sequences u.; satisfying 1 and 2, C is well defined in terms of p(X/Y) alone. 
However, H(X) - H(X/Y) involves p(X) in addition to p(X/Y).- One might guess 
that C is equal to l.u.b. (H(X) - H{X/Y)) over all choices of p( ).. This is 
indeed so, as the theorem below shows. Note the important fact that we have 
here a way of defining the channel capacity C without once mentioning infor- 
mation contents or rates. (Strictly speaking we should now consider the chan- 
nel as being defined simply by p(y/x). ) These remarks evidently apply equally 
well to Shannon's theorem, as we have stated it. We go now to the main 


theorem. 


THEOREM. For any e > 0 and H < C there is an n(e, H) such that among all 
sequences u of length n 2 n(e, H) there is a set {u,}. of number at least ou 
such that: 
1. to each u; there is a v-set B,, with p(B;/u;) >iletes 
2. the B. are disjoint. 
This is not possible for any H > C. 
PROOF. Let us note here that if we transmit the u; with equal probability and 
use a result of section III (namely re <e) we immediately obtain the positive 
assertion of Shannon's theorem. We shall first indicate only the proof that the 
theorem cannot hold for H> C, which is well known. Indeed if one could take 
_H>C then, as shown in section III one would have, for n sufficiently large, 
the result that the information rate per symbol would exceed C. But this 
cannot be (3). Q.E.D. In the following we will take p( ) as that for which the 
value C is actually attained (4). We shall see, however, that no use of this 
fact is actually made in what follows, other than, of course, C = H(X) - H(X/Y). 
For given €), oF E,, 55, let n,(c,.8t). n,(e 
-n(H(X/Y )+e, ) -n(H(X)-€, ) 
p(u/v) >2 and p(u) < 2 , respectively. Let us hence- 
forth consider n as fixed and n => max (n, (e,. 61), n(e,, 65). For Z and Us 
in lemma 2 we take, respectively, the sets on which the first two inequalities 
stated above hold. Then for any u € UL (with a as any fixed number <e) and 


v in the corresponding A, we have: 


9» 85) be as:in lemma | for 


p(u/v) pm Ae eSy) n(C-€ | -€,) 
p(u) s ee a4 nee 


(u, v) (C-€, -€,) 
p(u, v nee €)-€5 ey 


p(u) 
Summing v over A, we have 


p(u, A) n(C-€ | -€) 
plain >2 p(A,) 


Since 1 = p(A,/u) we have finally 


-n(C-€ , -€,) 
p(A,) <2 ar 


Let Upreces UN be a set M of members of U such that: 


a. to each u. there is a v-set B. with p(B,/u.)>1-e 
i i at 
s -n(C-€,-€,) 
be p(B; ) <2 (See footnote 5.) 


ec. the B; are disjoint 


d. the set M is maximal, that is, we cannot finda UNG] anda Byi) 


such that the set UyreUnryy satisfies (a) to (c). | 
Now for any u € Uy there is by definition an AL such that p(A,,/u) 2l-a 
-n(C-€,-€.) 
> 1-e and as we have seen above, p(A,) S72 SD) e . Furthermore, 


for any u € U), A, - AL 3 p2. B; is disjoint from the B;, and certainly 
i 


/ -n(C-€, -€,) 
p(a,- A, 7 B,) <2 


6 


If u is not in M, we must therefore have 
p(a,- Aas » B,/u) SN ae 
In other words, P(A, _ 2 B,/s) 2e- 4a, or certainly 
i 
p(s B/u)>e -«, for allue U) -M=U,-M-~ U, 


Now 


a0: B,)= > a0 B,/u) p(u) = 2 dart Meu bole B,/u)ptu) 


1 1 1 


- 
= Si 6 


+ 
> (e-c) lobe = a P(M a) + (1-e) p(M - U,) 2 (e-2) 155 = : 


a 


if e <1/2, since then 1 -e 2e - a. 


-n(C-€ ) -€,) 
On the other hand, ae i) < N2 Hence 
i 
=n( Cee.) ba 8) 
N2 2 (e-c) i - De ae "ue 
-n(H(X)-€, ) 
Ife > 1/2 then, using p(M - U,) < N2 , we would obtain 
-n(C-€  -€,) x, 8} -n(H(X)-€, ) 
N2 > (e-a) |1 - 85 - — - N2 


Since the treatment of both cases is identical, we will consider e < 1/2. 
To complete the proof we must show that for any e and H < C it is possible 
+0 x= + : 
> < + = 
to choose €): €,, by 55, ase, andn a CONN Sa: ny (€ ts) )) in sucha 
way that the above inequality requires N> 2""" Now it is ciear that, if, having 


(a= PZ 


chosen certain fixed values for the six quantities mentioned, the inequality 
oe 
fails upon the insertion of a given value (say N ) for N, then the smailest N 
: : oe : 
for which the inequality holds must be greater than N . Let us point out that 


N will in general depend upon the particular maximal set considered. 
nH 


oe = 
We take N = 2 and a = e/2. Then we can take a 55, and €, so small 
and n so large that 
Buoy 2 
bS8i5 fs is 5 say. 
-n(C-H-€, -€ 1) 
We obtain finally e/3 < 2 . Choosing €, and €) sufficiently small 


so that C -H - £5a5 €) > U we see that for sufficiently large n the inequality 


-n(C-H-€,-€ | ) Fis dl 
e/3 <2 fails. Hence for e = e/2, for €. €> Sy) 55 sufficiently 


ok 
small the insertion of N = 2H for N causes the inequality to fail for all n 


te 
sufficiently large. Thus N>N = De for such n. Q.E.D. 


7 


It is worthwhile to emphasize that the codes envisaged here, unlike those 
of Shannon, are uniformly good, i. e., the probability of error for the elements 
of a maximal set is uniformly <e. These codes are therefore error correcting, 
which answers in the affirmative the question as to whether the channel 
capacity can be approached using such codes (6). 

lf we wish to determine how e decreases as a function of n, for fixed H, 


we have (7): 


-n(C-H-€ ,-€,) 


: on ee = B= We re ts 
e5=s ait , where A = 2 , 2 
+ 
B - (6;/2) 
To eliminate the "floating" variable a, we proceed as follows. For a? 0 
1/2 
A (asj) +8 
Oe gL EG achieves its minimum value at a = B 
B - (51/2) 


2 
1/2 1 
and this value, namely, 5 lat? + (67) , is greater than SoM ae 


If we take 
1/2 
+ + 
(asy) +5} 


q =i’ 1 and e -4 Al Paale.) 


1 War + 1/2 ; : 
then a <e. Hence B A v (st) is an upper bound for the minimum 


value of e which is possible for a given H. This expression is still a function 
of €) and €,- The best possible upper bound which can be obtained in the pres- 
ent framework is to minimize with respect to €) and €,. This cannot be done 


generally and in closed form. j 

Let us remark, however, that at this point we cannot say anything con- 
‘cerning a lower bound for e. In particular, the relation a <e is a condition 
that is required only if we wish to make use of the framework herein 


considered. 


III. Let us consider a channel (i. e., (S,s), (R,r), p( ) and p( /s) where s 
is a transmitted and r a received symbol) such that to each s there is an 
r-set Ag such that p(A./s) >1-eand the Ag are disjoint: For.each r let 
po(r) =e p(s./r) where s__ is such that p(s /r) >p(s/r) for all s # So (Then 
P(r) is simply the probability that when r is received an error will be made 
in identifying the symbol transmitted, assuming that whenever r is received 
Ss. will be assumed to have been sent.) Now the inequality tr a <a-1canbe 
used to show that 


H(S/R) s -P_ logP. = (ie a) log (1 - ss) we) gen Ta (N-1) 


where Be Se Pay Of (0) P,(r) and N is the number of symbols in S. 
R 


We now make use of the special properties of the channel considered. We 


have 


% p(r)(1 - p(s/r)) = 1 - > p(r) p(s_/r) 


= 1 be Pie) plsp/t)i-eo de P(r) p(sy/r) 
Sax R 


= 
Ss 
2 Ss 


R- Ag 


=l-% Y pir)pts/r)-— > p(r) p@C/r) 
SA = 
S 


s 


= Is-. 5 oD. plr) pls/z).- re P(r) p(s/r) 
S-s_ A Recon 


os 
S-s 
fo} 


S-s, 


= lize > up(s) DIA /s)i- Ptsa) (R = 5 A,/*0) 
“ So 


ne > p(s)(l-e) - p(s, )(1-e) =e where s, is any s (8). 
S-s 
fo) 


Then H(S/R) < -e loge - (1-e) log (l-e) + e log (N-1) since for e < 1/2 the left 
side of the above inequality is an increasing function of e. (We assume of 
course e < 1/2.) 

Let us consider the elements Upreecs Uy of some maximal set as the funda- 
mental symbols of a channel. Then regardless of what p(u,) iss) 1c oc Ns 
the channel is of the type considered above. Hence Les <e (where e is as in 
II) and 


H(U/V) <-e loge - (1-e) log (1-e) + e log (N-1) 


Here H(U/V) represents the average amount of information lost per sequence 
transmitted. The average amount lost per symbol is 1/n H(U/V). Now for 

N = 2"! und H<C, e = e(n) ~0asn—+o. Thus 1/n H(U/V) ~0asn—o. In 
particular if we take p(u,) = ryt then 1/n [H(U) - H(U/V)] - H as n — 0. 
(This is the proof mentioned in footnote 2.) 

Actually, a much stronger result will be proven, namely, et for N = Ped 

H < C (and H fixed, of course) the equivocation per sequence H(U/V), goes to 
zero as n-—-o. Since log(N-1)=nH, a sufficient condition that H(U/V) +0 as 


n-o is that e(n)n +~0 as n ~o. 


Z : 
2 -n(C-H-€ , -€,) 
We saw that ect sla 1/2 , + (8 ry! where B=1-6, and A=2 Aine 5 


Now if we take € “sufficiently eas so that C-H- € i= *€5,> 0 and 


rea 
H(X) -H - €, > 0, ei behavior of 6} as n ~o is the only unknown factor 
in the behavior of e. If the original X Eee of only Xj)» Xp, and Y consists 
of only y,, Yo, and if p(x,/y>) = p(x,/y,), then log p(x/y) is only two-valued. 
If we take €, = = €(n) as vanishing, for n ~o, faster dont n- , then a theorem 
on large aoe (9)is applicable and shows that 5 P and hence e, ‘approaches 
zero considerably faster than 1/n. 
We omit the details inasmuch as a proof of the general case will be given 


in section V. 


IV. Up till now we have considered the set Y of received signals as having 
a finite number of elements y. One can, however, easily think of real situ- 
ations where this is not the case, and where the set Y is indeed nondenumer- 
able. Our terminology and notation will follow the supplement of (10). 

We define a channel by: 

1. the usual set X anda probability distribution p( ) over X 

2. a set Qof points w 

3. a Borel field F of subsets A of 2 

4. for each x € X, a probability measure p( /x) on F. 

We define the joint probability p(x, A) = p(x) p(A/x) and p(A) = p(X, A) = 
= p(x, A). Since p(x, A) <p( ) for any x, A, we have by the Radon-Nikodym 


theorem 


4.1 robe, AN) = if p(x/w) p(dw) where p(x/w) may be taken as <I for all x, w. 
A . 
As the notation implies, p(x/w) plays the role of a conditional probability. 


We define H(X) = - 2 p(x) logp(x), as before. In analogy with the finite 
x 


case we define 
4.2 H(X/Y)=-Y fo logp(x/e) p(x, dw)’ 
de AY: 


To show that the integral is finite. we see first, by section 4. 1, that 
p(x. {p(x/w) = of) = 0 


Furthermore, putting 


1 1 
A; = {sh < p(x/w) <4} 


we have, since p(A;) <Ip (St) =) lesthat 


IN 


f p(x/w) pido) <—* <4 
A, ; 


Hence 


1 1 l 
4.38 x, Sos < plx/o) i} = 
fa. 9} 21 


We therefore have 


IN 


co i 
4.4 -f logp(x/o) p(x, do) < Y +44 < 
Q = OReee 


by the ratio test. 

Everything we have done in sections'I, I], and III can now be carried over 
without change to the case defined above. A basic theorem in this connection 
is that we can find a finite number of disjoint sets Aj. = A. = Q such that 


= > px, A;) log p(x/A;) approximates H(X/Y) as closely as desired. Since 


Sa 
we make no use of it, we shall not prove it, though it follows easily from the 


results given above and from standard integral approximation theorems. 


10 


V. We shall now show that e = e(n) goes to zero, as n~o, faster than 
1/n, which will complete the proof that the equivocation goes to zero 
as the sequence length n +o. 

As previously mentioned, it is the behavior of 61. of lemma 1 that we 
must determine. The mathematical framework briefly is as follows. 

We have the space X ® Qof all pairs (x,w) and a probability measure p(, ) 
on thé measurable sets of X®Q. We consider the infinite product space 


nN ® (x ® 2). and the corresponding product measure 
=) | 


UR Ora ie (ae 
i=] 


Let us denote a "poini" of ll ® (X ® 2); by (x, ©, )= {(x,. 01), (x5, Wo), - at 
i=l 
We define an infinite set of random variables {Z,}, ee Ns og Cole 


co 

I] ® (x@g), 

res 
by Zi(x J, wo) = -log P(x, /w.), that is, Z, is a function only of the ih coordinate 
of (x |» ,,): Clearly the Z; are independent and identically distributed; we 


shall put E(Z,) for their mean value. From section 4. 4 we know that the Z; 


have moments of the first order. (One can similarly show, using the fact that 
OOM enn TI 
o> > weer for any n> 0, 
1=0) > 2 
that they have ae ane of all positive orders.) 
Let Sn = = Z.- Then the weak law of large numbers says that for 


T=: 
any €), 8), there is an n(é€ 1° )) such that for n= n(€,,6,) the set of 
Sn 


points (x, »O ) on which | — - E(Z,) 26 has Po (, ) measure less than 5): 
Now, in the rotate n of eection iL S. n Xoo Wy iE fine p(u/v) where u = 1x7: ee x} 
and v = {w,, Ee ant Bi while H(X/Y) = = zf Ieee p(x, dw) = E(Z,) . What we 


have stated, then, is simply lemma a 


Now, we are interested in obtaining an upper bound for 


‘S) 
Prob {2 - E(Z, i> «| 


More EEN we shall find sequences € 1 (2) and 2 i(n) such that, as n ~o, 
re 1 (n) (0) 61 (n n) ~ 0 faster than 1/n, and nt 1), 6 Hea = ine 
Let ziF ih = Z; whenever Z. <r, and Zee = 0 Sane By section 4. 3, 
; * n 
Ze ) and Z differ on a set of probability idee Let si") Spy ae ; then Sa 


i=l 
() 


=ran ree. 
and S_ ) <n/f2.. Furthermore 


differ on a set of probability <1 - (1 - 2 
i°.¢) - 
E(Z.) - B(z'")) « Sy gl Weed 
1 : 
by the same argument which led to section 4.4. We thus have: 


S ‘ g(t) 
Prob {32 Bz) = eto = Pron {=P - E(Z peewohs = 
n 


qT 


ins) 


g(r) 
< prop |B - E(z(")) > cco} 1 oF , 


g(r) 


since E(Z)) > E(z(")), In order to estimate Prob {7a = B(z{")) i cco} we 


use a theorem of Feller (11) which, for our purposes, may be stated as follows: 


THEOREM: Let {X, }, i=1,...,nbea set of independent, identically distri- 
n 

buted, bounded random variables. Let S = = x; and let 
i=] 


F(x) = Prob“{S -n E(X,) < x} 


- sup |X, - E(X,)| : 
Puticn = E([X, - E(X, )] ) and take \ > Semel a ae ae Then if 0 < Ax < 1/12 
on 
we have 


1 - F (xo n!/2) = exp[-1/2 x7 Q(x)] [{1 - 6(x)} + OX exp(-1/2 x?)) 


where 


x 
[8] <9, |Q(x)| <3 (7s) and (x) = anil fi exp[-y"/2] dy 


In order to apply this theorem, we take r = r(n). Now 
(r) (r) (r) 
(2, )= Zz) - E(z\"’) ~o(Z,) asr-e 


3 


aS o(Z,) > o(z{r)) > o(Z,) for ini ny: We can 


Hence for suitably large n 


-1/2 
now take \ = X(n) Sera r(n). 
1 


We henceforth consider n 2 no: We now have: 


(r) 
Prob {= 3 E(z'")) = «(oh = Prob { s\") =n E(z'"))> n € ,(n)} 


1/2 |_1/2 £1) 
ee 


! 
U 
Le] 
ie) 
ion 
n 
5 rs 
' 
r=} 
e] 
YR 
N 
ves 
Ss 
Vv 
Q 
ee | 
N 
> 
= 


IN 


r i/2'| i172 2€,) 
Prob {st —n E(z‘ "> o(z{")) n / E Uf Wz) 


N 


< exp Ee x? (224) E - &(x)} + 9A exp(- s)I. 


Using 
2 
1 x 
1 - {x) ~ — J5— exp (-—=- 
(2m)! /2 . ( 2 ) 


or 


12 


pen i ar J 
1 - (x) $ meryyr exp(- %-) 


we may rewrite the above as 


2 6d 1 
enol lai ll fo y Wee -} 


1/2 
Now } = X(n) = iE j r(n) and x = ni/2 a , while 
1 oC 


+ 2 6X 1] 2 
so hice coe 


It is now ae that we can pick € 1) and r(n) so that A(m) ~ 0, x - x(n) 0, 
Ax =U and 6 rn) — 0 faster than Syite 

Let us nave out that by using the approximation theorem of section III and 
thus having to deal with -log P(X/A, ), which is bounded, we can eliminate the 
term n/2rin). This makes it likely that Feller's theorem can be proven, in 
our case, without the restriction that the random variables be bounded. There 
is in fact a remark by Feller that the boundedness condition can be replaced by 
the condition that Prob {|x,| >n} is a sufficiently rapidly decreasing function 


of n. But any further discussion would take us too far afield. 


VI. We have, up to this point, insisted that the set X of messages be finite. 
We wish to relax. this condition now so that the preceding work can be applied 
to the continuous channels considered by Shannon (1) and others. However, 
any attempt to simply replace finite sums by denumerable sums or integrals. 
at once leads to serious difficulties. One can readily find simple examples 
for which H(X), H(X/Y) and H(X) - H(X/Y) are all infinite. 

On the other hand, we may well ask what point there is in trying to work 
with infinite message ensembles. In any communication system there are 
always only a finite number of message symbols to be sent, that is, the trans- 
mitier intends to send only a finite variety of message symbols. It is quite 
true that, for example, an atrociously bad telegrapher, despiie his intention 
of sending a dot, dash, or pause, will actually transmit any one of an infinite 
variety of waveforms only a small number of which resemble intelligible sig- 
nals. But we can account for this by saying that the "channel" between the 
telegrapher's mind and hand is "noisy," and, what is more to the point, it is 
a.simple:'matter to determine all the statistical properties that are relevant 
to the capacity of this "channel." The channel whose message ensemble con- 
sists of the finite number of "intentions" of the telegrapher and whose received 
signal ensemble is an infinite set of waveforms resulting from the telegrapher's 
incompetence and noise in the wire is thus of the type considered in section IV. 

The case in which one is led to the consideration of so-called continuous 
channels is typified by the following example. In transmitting printed English 
via some teletype system one could represent each letter by a waveform, or 
each pair by a waveform, or every letter and certain pairs by a waveform, and 
soon. We have here an arbitrariness both in the number of message symbols 


and in the waveforms by which they are to be represented. It is now clear that 


as} 


we should extend the definition of a channel and its capacity in order to include 
the case given above. 
DEFINITION. Let X bea set of points x and Qa set of points w. Let F be 
a Borel field of subsets A of &, and let p( /x) be, for each x € X, a probabil- 
ity measure on F. For each finite subset R of X the corresponding channel 
and its capacity CR is well defined by section IV. The quantity C = Jy he eke Cr 
over all finite subsets R of X will be called the capacity of the channel 
{x, p( /x), Qf. 

Now for any H<C there isa Cr with H < CR <C, so that all our previous 
results are immediately applicable. 

We shall now show that the channel capacity defined above is, under suit- 
able restrictions, identical with that defined by Shannon (1). 

Let X be the whole real line, and Q, w, F, and Aas usual. Let p(x) be 
a continuous probability density over X and for each A € F, let p(A/x) satisfy 
a suitable continuity condition. (See the Appendix for this and subsequent 
mathematical details.) Then p(A) = fee p(x) p(A/x) dx is a probability meas- 


ne) 
ure. Since p(x, A) = p(x)p(A/x) is, for each x, absolutely continuous with 


respect to p(A) we can define the Radon-Nikodym derivative p(x/w) by 
jabs, AN) = yf p(x/w)p(dw). Then, with the x-integral taken as improper, we 
A 


can define 


p(x/w) 


Cc =i fo ac p(x, dw) log 20 
ee Q p(x) 


If we put C, = oueb. Cy over all continuous probability densities p(x), then 
C3 is Shannon's definition of the channel capacity. The demonstration of the 
equivalence of C, as defined above, and oe is now essentially a matter of 
approximating an integral by a finite sum, as follows: 

If ee is finite, then we can finda Cp arbitrarily close to Cy; if Cy = +o we 
can find © arbitrarily large. We can further require that p(x) shall vanish 
outside a suitably large interval, say [-A, A]. We can now find step-functions 
g(x) defined over [-A, A] that approximate p(x) uniformly to any desired degree 
of accuracy, and whose integral is 1. For sucha step-function, C_ is well 


defined and approximates C, as closely as desired by suitable choice of g(x). 


n 
Let g(x) have n steps, with area p,;, and of course & Pp; = 1. By suitably 
1 


choosing positive numbers ai integers N,, and points xij’ with xij lying in 


i 
the ag step of g(x) and = ary = pj, we can approximate 
Jel 


A n Ni 
ptA)= f | etx) p(A/x) dx by BT a; p(A/a,) 


and hence C, by Cp, where R = {x; Thus C 2 Cy: On the other hand, let 
Re {x}, not as taken above. Let p(x; ) be such that H(X) - H(X/Y) = CR: Then 


the singular function 2 P(x;) &(x - x;), where 6( ) is the Dirac delta-function, 
“i 


can be approximated by continuous probability densities p(x) such that Cy 


approximates Cr: Hence Cy 2 Cor Ge Co 


14 


This can clearly be generalized to the case in which X is n-dimensional 
Euclidean space. 


VIl. We now wish to relax the condition of independence between successive 
transmitted symbols. Our definitions will be those of Shannon, as generalized 
by McMillan, whose paper (1) we now follow. 

By an alphabet we mean a finite abstract set. Let A be an alphabet and I 
the set of all ie oe positive, zero, and negative. Denote by al the set of 
all sequences x = (.. ~y Xo Xp> ae x,€ A, tel. 

A cylinder set in A isa subael of al defined by specifying an integer n 21, 


a finite sequence Qoyrree of letters of A, a an integer t. The cylinder: 


set corresponding to Rs ay Be Kas is {x eA xt aia. 4 Kiev0R, 28n= i} 
We denote by F,t the Borel field generated by the cylinder sets. 

An information source [A, p] consists of an alphabet A anda probaniity 
measure w» defined on Fy: Let T be defined by T(..., K_ yo X oe XpoXor ees = 
(Se tye XO Xp: ) where xX} =%X,,)- Then [A, 1] will be called stationary if, 
for Se€ nh u(S) = n(TS) (clearly T preserves measurability) and will be called 
ergodic if it is stationary and S = TS implies that »(S) = 1 or 0. 

By a channel we mean the system consisting of: 

1. a finite alphabet A and an abstract space B. 

2. a Borel field of subsets of B, designated by 86, with Be B 

3. the Borel field of subsets of B= = il ® B, (where B, = B) which we define 

in the usual way, i @ £p, and designate F F 


4. a function ve which. is, for each x € ie , a probability measure on F,, 


and which has the property that if xt = xt fort <n, then v , §S) =U, 2(S) 
x x 


n 
for any S € Fg of the form S = Ss, @ S,, where S, € Il ® Band 
00 


hb i @ B. 


Consider a stationary channel whose input A is a stationary source e [A, |. 


Let ¢} = 7% ® Bl and Fa= = F,@® Fg. We can define a probability measure on 


F , by p(R,S) = p(R@S) = if v_(S) du(x) for Re F S € F,, assuming certain 
Cc R * B 


A’ 
measurability conditions for v(S). It is then possible to define the information 
rate of the channel source, the equivocation of the channel, and the channel 
capacity in a manner analogous to that of sectionI. Assuming that »( ) and 
p(, ) are ergodic, McMillan proves lemma 1 of section I in this more gen- 
eral framework. Hence the proof of section III remains completely valid, 
except for the demonstration that the theorem cannot hold for H > C. 

The difficulty that we wish to discuss arises in the interpretation of p( /u). 
A glance at McMillan's definitions shows that p(B/u) no longer can be inter- 
preted as "the probability of receiving a sequence lying in B, given that the 
sequence u was sent." This direct causal interpretation is valid only for 
vif ). But the result of the theorem of section II is the existance of a set u; 
and disjoint sets B; such that p(B,/u;) > 1-e. Under what conditions can we 


derive from this an analogous statement for v,, (B;)? 
ik 
Suppose that for a given integer N we are given, for each sequence 


Xpress Xn yy of message symbols, a probability measure v( /x) es Xn4)) 


15 


on the Borel field B of received signals (not sequences of signals). We envis- 
age here the situation in which the received signal depends not only upon the 


transmitted symbol x +1 but also upon the preceding N symbols which were 


N 
transmitted. 
If u = bs oe aS 
(x At atin) 
* P\*_N+1 
p( /u) = s a 


psa os) 
X nap Xgl 1 n 


x [v( ator 65 X)@... @v( /x_py yes *z)] 


Let us write the bracket term, which is a probability measure on received 


sequences of length n, as vif /x -»*,). Now if p(B, /u;) > 1-e, then, 


-N+l’"* 


since 
i PQ naps Xp) | i 
D(x ne aCe) Bee 
[X_nNepe- oe %ol 1 n 
there must be at least one sequence Ca eu Abe xt for which 
a. jos SNS 
v (8 /*_ na x.) Spee 
H 


A minor point still remains: we had age sequences u; and we now have the 
same number of sequences, but of lengthn +N. In other words, we are trans- 
mitting at a rate H' = (n/n+N) H. But since N is fixed we can make H! as near 
as we choose to H by taking n sufficiently large; hence we can still transmit 
at a rate as close as desired to the channel capacity. 

It is evident that by imposing suitable restrictions on viel ) we can do the 
same sort of thing in a more general context. These restrictions would 
amount to saying that the channel characteristics are sufficiently insensitive 
to the remote past history of the channel. 

In this connection some interesting mathematical questions arise. If we 
define the capacity following McMillan for the v( /x,: eee sas Xn41) as above, is 
the capacity actually achieved? It seems reasonable that it is, and that the 
channel source that attains the capacity will automatically be of the mixing 
type (see ref. 12, p. 36, Def. 11.1; also p. 57) and hence ergodic. Because 
of the special form of eel ) it easily follows that the joint probability measure 
would likewise be of mixing type and hence ergodic. 

The question of whether or not the equivocation vanishes in this more gen- 
eral setup is also unsettled. Presumably one might be able to extend Feller's 
theorem to the case of nonindependent random variables that approach indepen- 
dence, or perhaps actually attain independence when far enough apart. To my 
knowledge nothing of this sort appears in the literature. 

Finally there is the question of whether or not, in the more general cases, 
the assertion that for H > C the main theorem cannot hold is still true. While 
this seems likely, at least in the case of a channel with finite memory, it is 


to my knowledge unproven. 


16 


APPENDIX 


It is our purpose here to supply various proofs that were omitted in the 
body of the work. 
1. H(X) - H(X/Y) is a continuous function of the p(x; ), Deal See Aa 
PROOF. H(X) is clearly continuous. To show the same for H(X/Y) we need 


only show that for each i, -P(x;) f log P(x; /w) p(dw/x;) is a continuous func- 
Q 
tion of p(x, ), Sits » P(x, ). Now 


P(x;, dw) p(dw/x; ) 
P(x; /w) = ~ p(dw) _ = p(x; ) p(dw) 


But since " P(A/x; ) =p(A), we have (see ref. 13, p. 133) 


sia, p(dw/x; ) p(dw) 


IE p(dw) = p(du/x;) 


p(dw/x. ) 2 P(x; ) p(dw/x:; ) 
1 


ena ns oe bes 1 
~ Pde) © = p(dw/x, ) 
1 


almost everywhere with respect to = p( /x;) and hence, certainly, almost 
everywhere with respect to each p( /x;). Thus 


plde/x,) © pldw/x,) _/ = POs) P(de/x,) 


p(dw)  ~ = p(dw/x;) = p(dw/x;) 
1 1 


almost everywhere with respect to p( ). The dependence on the P(x;) is now 
explicitly continuous, so that each P(x; /w) is a continuous function of 


P(x, aes P(x, ) almost everywhere with respect to each p( |x; ). We now wish 
to show that -p(x; ) i log p(x. ,/») p/du(x, ) is a continuous function of the P(x; Ne 


To this end let if (x, )eeren me (x me juacl, abe. a convergent sequence of 
points in a-dimensional ieee space we with limit {p, (x, Neuen c Po (x, )t- 


Then we have lim Pp; (x,/e) = Po (x, /e) almost everywhere with respect to each 
joo 
p( /x). We must now show that 


-P ;(x;) it log p j(x;/e) P(dw/x: ) =P, (x;) if log Po (x; /) p(dw/x:; ). 


Suppose, first, that Po (x;) # U0. Now from section IV we have 


p(dw/x.)  / = P&) p(dw/x;) 
ip - log aoe Ree | ) -| p(dw/x; ) <0 
i 4 


whenever p(x;) #0. Take p(x;) = sad =... p(xo)= 1/¢. Then: 


P(dw/x:; ) 
[ - log @ = plae/x,) p(dw/x; )<o  orclearly 


p(dw/x, 
i - log SoS p(dw/x, )<o. But 
i 


aly ( 


meee p(de/x, ) Pye e ers y) ee 
- log § eae ee pig > p(dw/x, ) e Pp (dw/x; ) me inlet 
i 


Since the last term is also bounded below by log P (;), then by reference 14, 
p. 110, we have 


pldu/x,) _/ PO) Plde/x,) 


uel oe yor = plde/x,) Sasa Bie 
(a0/x,) 

p(du/x,) _/ > Pimile 

Slee ea CA ac 

1 


Since p.(x;) # 0, -p ;(x;) B log p(x; /w) p(dw/x;) = P ,(x;) 


p(dw/x. ) a Ales), p(dw/x:; ) ‘ 
us logs) DK) = p(dw/x, ) = p(dw/x,) p(do/x;) ~Po(x;) 
i i 
p(dw/x. ) As Po (x;) p(dw/x; ) | 
if = tog P(x; ) SB p(dw/x: ) D P(dw/x; ) pido/2) 
i i 


= -p,(x;) ip log p(x; /w) p(dw/x, ) 


If Po (x;) :- 0, we can clearly assume P,(x;) # 0, since we have to show that 


-p.(x.) If log p.(x,/w) p(dw/.) + 0. As before we have 
jae jmet i 


Bee) p(du/x, ) 
-log Sahieeis tas -log = plao/x,) n(dox therefore 
lnmek 


p(dw/x; ) 
Pj(x;) Ne log p,(x,/e) p(dw/x;) < p;(x;) ih log E pe/x) p(dw/x, ) 


+ P,(x;) log as p,(x;) —0 (i.e., as j ~o). 


1 
——- ~ 0 
pj (x,) 

2. We wish here to rigorize the discussion of section VI. 
We assume that p( /x) satisfies the following continuity condition: For any 
finite closed interval I and any € there is a &(I, €) such that 


pie 
p(A/x,) ~ : 


<e for |x, -x,| <6 and x xX» € I, 


2 ie 


whenever p(A/x,) #0. It follows that if, for x, € 1, p(A/x, ) = 0, then for 
x, € land [xs - x, | < 6, p(A/x,) = =a0e acces since ualay ei O} is evi- 
ae both open and closed, for any A, p(A/x) either vanishes everywhere 


00 

or nowhere.) That p(A) ai p(x) p(A/x) dx is a probability measure is 
00 

a simple consequence of reference 14, p. 112, Theorem B. Since p(x) p(A/x) 


is continuous, p(A) can vanish only if p(x) p(A/x) is zero for all x. Hence, 


18 


for all x, p(x) p(A/x) is absolutely continuous with respect to p(A). 

We can sharpen this result as follows: Let I be a closed interval over 
which p(x) # 0. Then for a given € we can find a & such that p(x,)> P(x, )/2 
and p(A/x,) > (l-e) p(A/x,), for x,,X, € land |x) - xo <6. We thus have 


ee) > jabs) 
[. PO) P(A/x) dx > 28 —S* p(A/x, (1-€) = & p(y (1-€) P(A/x,) 
—00 
Thus for any Xo € L, 


l z. 
P(x, )p(A/x, ) < (1-€)8 p(A) = k(x, ) p(A) 
which defines k(x, ) <o. As in section IV, we can easily show that 


=O = f log p(x/w) p(x, dw) <oo forall x. Now 


p(x) p(x) 7 
p(x, dw) log < p(x, af - : loge 
i. p(x/w) i, p(x/w) 


p(x) 
= if p(x/w) pias) = ’ log e = 0, 
2 P(x/w) 


the next to last equality being justified by reference 14, p. 133. Therefore, 
P(x/w) 
if if p(x, dw) log 

Q 


is, say, continuous in x, then 


p(x) 


is meaningful and is either positive or equal to to. 
p(x/w) 


We shall now show that if p(x, dw) log is indeed continuous. To 
Q 


p(x) 
this end let x5 be a convergent sequence of real numbers with limit Xo: We 


shall show that p(x; /w) —p(x,/w) almost everywhere with respect to p(x), )). 
(Since for p(x.) = 0 this assertion is trivially true, we assume that P(x, # 0).) 
Let a = {p (x; /w) = p(x/) > 1/n} and Ai, = {p(x,/w) = p(x,/w) <-1/n}. Now 


+ + + 

p(x; ) p(A;_/x;) - P(x.) P(A; ,/%) = fe (p(x, /w) - p(x,/w)) p(dw) > 1/n p(A,_,). 
in 

There is clearly no loss in generality in assuming p(x, ) #0. Then 


P(x.) p(x) - p(x;) 


+ + + + 
= > ———_—_1—. aa OF: 
Pp (A;,/%:) P(Aj,/*0) k(x.) n p(x; ) p(Az,/x,) bs P(x; ) P(Aj,/%o): 
p(k.) p(x.) - p(x;) 
Now K() np X) + RANCH Te is positive and bounded away from zero for all 


i sufficiently large. By the continuity condition on p( /x) we therefore have 
p(Ave/xa)= 0 for i > i(n) suitably chosen. We get a similar result for 
p(Aj,/*o): Let At be the set of points w which lie in infinitely many A. and 
similarly for An: Then p (AZ/x,) = 0, and so, 


19 


ae et = ; + = 
p(s AL +) Aa/x6) = p(x, (A, 1 Be )) ="0 
n n 


n 


But for any w € Q- & At =i DAS, p(x; /w) — p(x, /o), which was to be shown. 
n n 
As before, let Xi be a convergent sequence with limit Ko 


a. Let us assume first that p(x.) #0. Now 


Be -log P(x ,/w) p(dw/x,,) - Af - log p(x; /w) p(dw/x; ) 
= uk [-log P(x, /e) + log p(x;/w)} p(dw/x,) = if - log p(x;/) p(dw/x; ) 


< 


+ i) -log p(x,/w) p(dw/x,) 


i [-log p(x /w) + log p(x; /w)] p(du/x,) 


+ 


Jf, ~ 106 Pl&;/2) plde/xg) - [ - 108 pl,/x) plde/x;) 


To show that the first of the last two terms goes to zero, we remark, first, 
that since p(x) # VU and p(A/x,) < (lta) p(A/x,) for any A, for i suitably large, 


it follows, as in section IV, that 


I ies noe a p(x; /w) p(dw/x,) 

is uniformly bounded for all i, where we use the previously shown result that 
p(x) p(A/x) < k(x) p(A) <M p(A) for M suitably chosen and x ina closed inter- 
val containing Xo- It is now a simple exercise, by using reference 14, p. 110, 
to justify the interchange of limit and integration, so that the term in question 
vanishes as i~o. The relation p( /x,) < (lta) p( /x;) < (ate p( [x50 with 
a~-0Oasi-—o, at once shows that the second term likewise vanishes as i ~o. 


b. Now suppose that p(x.) = 0. Then by definition we take 


J - 108 Plo/e) Plxg, dw) = 0 


If p(x) is identically zero in some neighborhood of X, there is nothing to be 
proven. We can then assume that P(x; ) #0. For a closed interval containing 
x, and the x,, we have, for les - x,| sufficiently small (or equivalently, i and 
j sufficiently large) that p( /x;) = (1-€) p( /x;). Thus 
pide/x/) 
p(x, /w) = p(x.) ———— _(1-€). Hence 
i i 
p (dw) 
p(dw/x;) 


a log p(x; /w) = log [p(x, )(1-€)] = log 
p(dw) 


for fixed j and any i, both sufficiently large. Further, since p(x;) #0, 


P(x;) p(A/x;) = M p(A) for suitable M. Hence 


1 P(x) 
p(x,) p(A/x;) < T-€ pe) M p(A) 


20 


Since P(x; ) - 0, we have, for sufficiently large i, p(x; ) p(A/x, ) <p(A), so that 
P(x; /w) < lor -log p(x; /w) >0U. Therefore 


[ - log p(x;/w) P(x;, dw) < - p(x.) log [p(x; )(1-€)] 


p(dw/x.) 
J 

+ p(x.) =a nye 
= ip p(dw) Test 


As i approaches o, the last integral approaches 


p(dw/x.) 


J 
f - log ———_—— 
2 p(d 


p(dw/x,), which is <oo, 
Ww 


using arguments as in section IV. Since P(x; ) - 0, we have, finally, 


i. - log P(x; /w) P(x,,dw) +0 as i ~o. 


References and Footnotes 


1. C. E. Shannon, A mathematical theory of communication, Bell System 
Tech. J. 27, 379-423, 623-656; also B. McMillan, The basic theorems 


of information theory, Ann. Math. Stat. 24, 196-219. 
2. That it is indeed sufficient will be shown in section III. 


R. M. Fano, Lecture notes on statistical theory of information, 
Massachusetts Institute of Technology, spring, 1952. This statement 
asserts that if the channel is considered as transmitting sequence by 
sequence its capacity per symbol is still bounded by C. Using the fact 
that the reception rate per symbol may be written as 


H(V) - H(V/U) 
n 


the statement follows upon noticing that H(V/U) depends only upon single- 
received-symbol probabilities and that H(V) is a maximum when those 
probabilities are independent. The expression H(V) - H(V/U) then reduces 
to a sum of single-symbol channel rates, from which the assertion follows 


at once. ' 
4. It is not difficult to see that H(X) - H(X/Y) is a continuous function of the 
"variables" r, = p(x; ), i=1,...,a. This is true also in the context of 


section IV (c.f. Appendix). Since the set of points in a-dimensional 


a 
cartesian space R) defined by ri 2 Q and 2 re 1 is a closed set, 
i=] 

H(X) - H(X/Y) attains a maximum value. This point is, however, not 
critical, for, given H < C we can certainly find p( ) such that 
H < H(X) - H(X/Y) < C and then use H(X) - H(X/Y) in place of C. 

5. This condition appears to be superfluous. It is, however, strongly 
indicated by the immediately preceding result and is, in fact, essential 
for the proof. 


6. E. M. Gilbert, A comparison of signalling alphabets, Bell System Tech. 
J. 31, in particular p. 506. 


21 


Up to here, the possibility that certain quantities are not integers can be 
seen not to invalidate any of the various inequalities. In what follows, the 
modifications needed to account for this possibility are obvious and insig- 
nificant and are therefore omitted. 


Word-wise, this string of inequalities states simply: (a) that in order to 
minimize the probability of misidentifying the transmitted s we should 
guess the s with greatest conditional probability as the one actually trans- 
mitted; (b) if instead of the above recipe, we assume that s was sent, 
whenever r€A_ is received, for all s except s_, and that in all other 
circumstances we shall assume s_ to have been sent, then the probability 
of error is less than e; (c) hence, since P. is the error obtained by the 
best method of guessing, Ee ‘er s 


See reference 13, pp. 144-5. This was pointed out by Professor R. M. Fano. 


J. L. Doob, Stochastic Processes (John Wiley and Sons, Inc., New York, 


_ 1953). 


W. Feller, Generalization of a probability limit theorem of Cramer, 
Trans. Amer. Math. Soc. 54, 361-372 (1943). 


E. Hopf, Ergodentheorie (Julius Springer, Berlin, 1937). 


W. Feller, An Introduction to Probability Theory (John Wiley and Sons, 
Inc., New York, 195v). 


P. R. Halmos, Measure Theory (D. Van Nostrand, New York, 1950). 


22 


BINARY CODING 


Marcel J. E. Golay 
Signal Corps Engineering Laboratories 
Fort Monmouth, New Jersey 


INTRODUCTION 


The upper bound given by Shannon! to the transmission capacity of a noisy, discrete channel has 
challenged the mathematicians, who have accepted this challenge, to devise digital error correcting 
codes or coding Systems. approximating as close as possible this upper bound. 


This mathematical effort has been concentrated in the binary system and has had the aim to devise 
codes which are as efficient as possible, in the sense that, given an upper limit to the number m of 
errors during the transmission and reception of a block or message of n binits, the following obtains: 

(a) All messages are received in all cases without equivocation; ,(b) The number of transmittable 


Messages approaches as close as possible the value LOS bea: Rae » the sum in the denominator 
being the sum of the (e + 1)st numbers of the nth mao (N-rm)/ on! ' line of Pascal's triangle, 


and representing the number of ways in which any one transmitted message can be received when trans— 
mission errors in any number from zero to e can occur. 


Codes in which the upper limit is exactly reached will be termed lossless codes in what follows, 
and it may be worth noting here the paradoxical circumstance that while the existence proof for codes 
approaching indefinitely Shannon's upper: bound was based on the assumption of codes consisting of 
random messages, the search for efficient or lossless codes has been successful to the extent that 
codes were discovered which were characterized by deeply seated, entwined symetries. 


It is the purpose of this discussion to explore certain aspects of this circumstance, and to 
describe some group-theoretical approaches to coding problems. 


The first example of a symbol correcting code was given by Shannon® who quotes Hamming's lossless 
coding of a seven binit message, none or one of which can be received in error. This.case was extended 
by the writer to blocks of 2"-1 binary symbols, and, more generally, to blocks of p%-1 p-nary symbols 
(p prime), none or one of which can be received in ron With the exception Da of the trivial 
cases of (2n +1) binit messages, up to n of which can be received in error, and of two special cases 
treated in the last paper cited, these are the only cases of lossless symbol coding known, and the 
possibility must be considered that others do not exist. Their impossibility will be demonstrated 
below for the case of lossless 2-error correcting symbol codes, and it will be shown also that the 
search for e-error correcting symbol codes need be a finite one only, because lossless symbol 
correcting codes become impossible beyond a determinable message length, for any one selected value of e. 


These results will leave open the question of whether cases of two or more error—correcting 
lossless message codes exist (outside of the one mentioned above) because message codes form a more 
general class of codes than symbol codes, which form a sub-class of it only, and various examples of 
message codes which are more efficient than symbol codes will be cited, end their mode of formation 
illustrated. It is this mode of formation which is suggestive of the kind of group theoretical 
approach which the writer believes to be the most promising for the class of coding problems considered. 


Symbol Correcting Codes 


When a symbol correcting digital code exists for the transmission of n-binit messages, up to e 
of which can be received in error, end i of which (the “m's) carry the message while the remaining 
Jeane cthe Yy's) binits are redundant and are provided to remove the equivocation, the transmitted 
binits are related by the matrix: 


ie 
EE BRS 7 Wein ie eee (mod, 2), an = 42,---¢ 
wn 


k=/ 


and the essential property of this matrix is that the E's recalculated from the partially erroneously 
received X's and Y,'s form a j-binit number E, which will be termed the corrector, end which deter- 
mines univocally which symbols were received in error. 


1 Bell System Technical Journal, July, 1948. 


2 Loce Cite, Pe 418. 
3 Marcel J. E. Golay, "Notes on Digital Coding," Proc. I.R.E., vol. 37, De 6373 1949» 


Bs} 


When the code is lossless, a first condition must obtain, which stipulates that all possible cases 


of upto e errors ere represented by all the possible values of the corrector: 
ac al j 


peed, 
an (n-A}T AR? = (1) 
Another condition can be obtained as follows: 


Whenever all but one Yy and all X's received are zero, the binits of the corrector E(k) consist 
of the series of ag, velues for the particular k considered, and will be termed the characteristic of 
Ye 


Whenever all but one Xp end ell Yy's are zero, the binits of the corrector E(m) consist of zeroes 
. with a single one corresponding to the particular m considered, and will be callec the characteristic 
of Xm- In general, the corrector E consists of the j-binit number formed by adding modulo 2 the car- 
responding binits of the characteristics of the symbols received in error. A genersl condition for 

a lossless code is that all possible ( booleen) additions thus made of up to e characteristics repro- 
duce exactly, end only once, each of the 23 possible values of the corrector. 


If the parity of the characteristics or of the corrector is defined as zero when the number of 
ones in these numbers is even, and as one otherwise, it will be readily seen that the parity of the 
corrector will be the parity of the number of odd characteristics (parity one) required to form it. 
In & lossless code, ell even correctors, 2/~# in number, shall be formed from all possible additions 
of up to e characteristics in which the number of odd characteristics employed, 2s, is always even. 
Let r be the total number of odd characteristics. We shall have the other condition sought: 


kze S$ @r-r)/ rl Jj-/ 
2, 2 ¢ —r—~k+25)Mk-25)) | (Fas) ia 2 (2) 


A corollary from (1) and (2) can be obtained by subtracting the second relation from the first, member 
by member. This operation yields the relation: 


= + ee! ee = 2?! ) 
fee (A-k-0428-")"(w—ase))/ (r-2stI (25-4)! 
When e = 2, (1) and (2) can be written: 

n2$n-2 = 2st (1a) 

(n-r4l) re 2st (2a) 


These relations are satisfied forn = 5 andre 2or4 (ra 2 does not correspond to any code, 
and r = 4 corresponds to the trivial case of a five binit message, up to two of which can be in error), 
but for larger values of n and r_ the approximation obtained by eliminating all but the highest degree 
terms in the left members of (la) and (2a). 
n@ = 2341 (1») 
(nr) rie (2b) 


indicates that n@ 2r. 
(2a) requires that: 


r= ait andn{1 = oe 


and substitution of the value for n derived from the last relation in (la) yields n = 1, which contre- 
dicts the postulation of e large n. ; 


In the general case where e>2, it can be shown that the search for a lossless code need be a 
finite one only as follows; 


The highest power of n in (1) is in the term Las It is therefore, possible to rewrite (1) as 
follows: 


n® (1 4¢) =e! 2d (1c) 


2h 


in which, for any given e, the quantityg¢ can be made arbitrarily small for large values of n. 


The difference between the number of even and odd correctors should be zero in a lossless code, and 
this condition can be expressed by the relation: 


k=e é 
(nm-r)/ r/o (4) 
Zz JB Me “9 a T-kt&)I(k—-e)!’ ~= (-E)TET se 


which is obtained by subtracting (3) from (2), member by member. 


The highest power terms in n and r in the expression above are: 


& (n-r) nse) Mare n-ar)e (5) 
ma aw) (e-4)/ Ef e/ 


all other terms being of the form n@rb where a $+ b<e. It is seen thus that’ (4) will be satisfied when 
n and r are related by an expression of the form: 


ne 2r (1 +VY) (6) 
in which, for any given e, Y can be made arbitrarily small by making n and r sufficiently large. 


It will be noted now that r can be factored algebraically out of (3). The terms multiplied by 


r which are of the form (r - 1g could be fractional, but each term will be an integer 
if multiplied (r-2s $1)! (25-1)! by the highest common denominator of r and 2s-l, 
hec.d. (rs2s—1). Therefore, if the lowest common multiplier of a 


h.c.d. (r,2s-1)'s is factored out of r, l.c.m. (all h.c.d. (r,2s-1)'s) = r', the multiplication by 
r' of all terms aultiplied by r in the left member of (4) will be integers in all cases, and in order 
to satisfy (4), it should be possible to write r in the form: 


re 2a" rn Zr! (7) 
It will be further noted that r', and hence r" also, have the upper bound: 
rt, rt Z@  licwm. (all (2s-1)'s), 28-1 K€ e (8) 
Elimination of n and r between (lc), (6) and (7) gives: 
ge(a +1) pe) 4 ye. 4 é) = et 29 (9) 


For any givene, ff and € approach zero for increasing j, while r" has an upper bound. Therefore 
an upper bound for j exists, beyond which (9) will not be satisfiable, because either e! will contain 
odd prime factors not contained in the left member of (9), or the left member of (9) will contain a 
number of odd prime factors which is a multiple of e, and which exceeds the number of the same odd 
prime factors in e}. 


While this demonstration indicates that the search for lossless two or more symbol correcting 
binary codes need be a limited one only for any chosen number of errors, a search for such codes has 
only revealed, outside of the trivial cases of n errors in a 2n$l binit message, the case mentioned 
earlier of a 3-error out of a 23-binit message symbol correcting code.. Whether, with the exception 
of the trivial cases mentioned, this particular 3-error symbol correcting code is the only lossless 
binary code correcting more than one symbol, is a matter of speculation. The degree of rarity of 
the happenstance required for the satisfaction of both relations (1) and (2) suggests that it could 
be so indeed, and offers the challenge of finding a mathematical demonstration of the impossibility 
to satisfy (1) and (2) for any other case. 


Message Correcting Codes 
The demonstration above leaves open the question of whether there are lossless e-error message 


correcting binary codes for eny length of message, for condition (1) only applies to these, while 
condition (2) does not, since it is predicated upon the existence of a lossless symbol correcting code. 
For instance, the question is left open, whether 4 2-error correcting 90 binit message code exists, 

_ since condition (1) is satisfied for this case. 


The possibility that lossless message correcting codes exist where lossless symbol correcting 
codes do not is thus predicated upon the circumstance that message correcting codes form a more general 
class of codes. While no lossless binary message correcting codes are known, for which there are no 
corresponding lossless symbol correcting codes, examples will be given below of lossy message correct— 
ing codes which are more efficient than the available symbol correcting codes for the same number of 


25 


of message symbols and maximum allowable number of errors. 


Some of these examples will be derived from the a,, matrix already published in the referenced 
Letter to the Editor, and the formation of the top ten symbols in the Yo to Yj5 columns will be ex- 
plained briefly first. 


If we consider five straight lines in a plane, A, B, C, D and £ and order their respective inter- 
sections as follows: AB, AC, AD, AE, BC, BD, BE, CD, CE, DE, then Yo is formed by associating a O 
with the four intersections represented by the products in the expression: 


A(B + C 4+ D4 &) 


and al with all other positions. Y¥3, Yz,, Y5, and Y6é are formed likewise by associating aO with the 
intersections of B, C, D and E respectively with all four remaining lines. 


Y7 is formed by associating a 0 with the 5 intersections AB, BE, ED, DC, and CA of neighboring 
lines (including the first and last) in the operator: 


( ABEDC) 
which will be designated to represent the ensemble of 5 intersections listed above. 


There are 4! cyclical permutation of the five lines, which can be separated into two groups of 12, 
the members of any one group being derivable from the other 11 members by an even number of inter- 
changes of elements so that they can be said to be of the same parity. Within each group of 12 there 
are 6 pairs of permutations which differ only by their order, so that both members of each pair deter- 
mine the same ensemble of 5 intersections. Thus, there will be only 6 district ensembles of 5 inter-— 
sections having the same parity, and those belonging to the parity of the operator written above for 
Y6 will determine the O's of the upper 10 places of the Y¢ to Yio. 


It can be verified by inspection that the upper 10 symbols of Yo to Y12, as well as the Boolean 
additions of any two of these ensembles, differ from all others in at least three pleces. Thus we 
can form the ensemble of 66 10-symbol messages written in Table I, which, together with the all Q's 
and all 1's messages form 68 10-symbol messages which differ in at least three places, and are there- 
fore l-error correcting messages. 


TABLE I 


001110001010001110: 101110010110101010:101110010101100110100111010111 
011101100001010010:010101001110011011:100010100110111101111101011001 
010101010010101001:110010110010011101:011101100110001110101010111101 
010010100110010101:101101100101011100:110101011010100101011011001111 
100011101000100011:011011011001001110:110100001101011011101101101110 
101011010100011000:100100011001110111: 010011110001010111111011110010 
100100011001010101:110011100101100011:101011011010011010011100110111 
110000000101101110;001111101010100101 : 001111101001101001110110101011 
LITT T1 *000000000000000000: 111111111111111111000000000000 
LULL L111 ii LUT TTTITTII11I * oood00000000000000000000000000 


Likewise, the two smaller blocks of 36 9-symbol messages and 18 8-symbol messages, shown within 
the dashed enclosures, indicate that, together with the 2-all O's and all 1's messages, there are 38-1 
error correcting 9-symbol messages, and 20 l-error correcting 8-symbol messages. 


On the other hand, it can be easily verified that there are only 16, 32 and 64 messages possible 
on the basis of 1 error correcting symbol codes for 8, 9 and 10-symbol messages respectively, because 
the number of cases of zero or one error are 9, 10 and 11 respectively in these three cases, which 
requires the assignment of 4 redundant symbols to the removal of the equivacation, thus leaving only 
4, 5 and 6 symbols respectively for the message transmission. 


It will also be noted that the upper 10 places of all Yo to Yj2, plus the all O's message, form 
an ensemble of 12 10-symbol messages each of which differ from all others by.at least five symbols, 
and are, therefore, 2-error correcting messages. The number of ways in which 0, 1 or 2 errors can 
occur in 10 pleces is: 14104 45 = 55, which indicates that a minimum of 6 redundant symbols should 
be assigned to the removal of the equivacation, thus leaving at most 4 symbols for the message trans- 
mission. However, it can be verified by inspection that it is impossible to form 4 6—-symbol character- 
istics, which together with the 6 correctors for redundant X's constitute an ensemble in which any 


26 


member of which, and any sum of two of which differ from all other single members or sums of two. 


On the other hand, it is possible to assign 7 symbols to the removel of the equivocation, and 
to have 3 7-symbol characteristics satisfying the conditions required so that only 3 symbols become 
thus available for the transmission of only 8 possible messages. Thus, here again, a larger number of 
messages can be transmitted by message coding than by symbol coding. 


When the formation of 2-error correcting message codes is extended to 15-binit messages, in which 
the 15 intersections of 5 straight lines are associated in various ways with the message symbols, more 
care is required for the selection of favorable symetries. 


Thus, we may associate the five intersections: 


A(B+C+4+D4 E+ F) 


with five O's and the 6 groups of intersections of any one of the six lines with all others gives us 
6 messages sufficiently distant from each other for 2-error correction. 


We may consider next the 15 groups of intersections given by the various products of the form 
(A+B) (C+D+ E+ F) 


and we can verify that these vary in at least 5 symbol positions with each other and with those of the 
preceding form. 


The 10 groups of intersections determined by expressions of the form 
(A+B+cC) (D+ E+ F) 


can be added likewise to the other groups while satisfying the required criterion of a minimum of 5-sym- 
bol separations for 2-error correcting messages. 


The 6 groups of 5 O's represented by the operator 
(ABCD R&) 


and the five other operators of the same parity derivable from it can be verified to represent messages 
sufficiently distant from all others to permit 2-error correction. The letter F can be substituted for 
any and all other letters provided any other two letters are interchanged whenever a substitution is 
made, to provide more messages satisfying the 5 symbol distance criterion. Thus, 36 messages of this 
last type can be formed. 


The total of messages satisfying the 5-symbol distance criterion which can be formed as indicated 
above is therefore: 


6 +15 + 10 + 36 = 67 
It can be verified further that the 67 new messages formed by the boolean addition of the all 1's 


message to these satisfy the criterion with the 67 old messages. Adding the all O's and all l's 
messages gives us the total of 136 messages for the case of 2-error correcting 15-symbol messages. 


Up to 2 errors can occur in a 15 symbol message in 1 + 5+ 105 = 121 ways , and the upper bound 
to the number of theoretically possible is therefore 2°’. The number of possible messages found above, 
136, is seen to be slightly over half the number 121 given by that upper bound. With a symbol- 


correcting code, 128 messages, i.e. slightly less than half the upper bound stated, could 
be transmitted by means of the a, matrix given in Table II. 


TABLE Il 


Matrix for 2-error Symbol Correction of 15-symbol Messages 


27 


PRPRPEPRPRPHER 
COCOFPKRFRPH 
COPRFOORPR 
FPrROrFOFOO 
PROOrFOCOF 
OrPrFPRFPRFOOO 
rFOrRFOFOFO 


The examples of message coding given above suggest the question of whether the procedures de- 
scribed could be made methodical and be extended to longer messages. This question cannot be 
answered at this stage; instead circumstances will be pointed out which make such an answer difficult. 


In the case just examined of a 15-symbol code in which the symbol positions were associated with 
the 15 intersections of 6 straight lines in e plane, a restricted number only of line groupings were 
studied. For instance, messages in. which the 0's or 1's are given by the intersection of elements 
not above each other in the two lines of the matrix AEG constitute another symetrical grouping, which 
examination indicated not to be useful in building 2-error correcting 15-symbol message codes, 
but which could be useful in other codes. Thus, a yet unsystematized selection of favorable groups 
must be made. 


Codes may be based on the restricted class of n (2n-1) symbol messages (n & 3) which can be formed 
by assigning the symbols O or 1 to the positions determined by the intersections of 2n lines, 
Ay, 4,..++Aon, given by all expressions of the form: 


GAT Ee et hdg) WCAon bods ae ee hoy) 
and by assigning 1 versus 0 to all other points. 


Together with the 2 all O's and all 1's messages, these number 27 — 1 and are n-3 error correcting. 
Thus, 15, 28 and 45 symbol messages will be in number 25, 27, and 29 and will be 3, 5 and 7 error 
correcting respectively. This code equals the Reed code in the case of 15-symbol messages, and is 
inferior to it for longer messages. A short examination of codes formed by considering the inter- 
sections of the planes Ay, Ap, - - — - An, in a 3 dimensional space which are of the form: 


Cay tim cy tabi) Apageatchigs oh An aol Aas note oe haein) 


has not indicated that an extension of this attack to multidimensional spaces is promising. There 
again, a selection of proper symetries is required. 


Another circumstance to be pointed out here is the completely symetrical part played by all 
straight lines in the formation of the 1 or 2-error correcting 10-symbol messages and 2-error correct— 
ing 15-symbol messages described above. By contrast, an examination of the lossless l-error symbol 
correcting 15 symbol message code can be seen to be expressible in terms of the intersections of 6 
lines in which 4 lines play symetrical roles, but the other 2 do not. This may permit the speculation 
that an approach to the problems of building a 2-error correcting 90 symbol message codes of 27 
Messages may be to consider the 90 intersections of 14 lines in which 2 lines, the intersection of 
which is not counted, play a part not symetrical with that played by the 12 others. 


CONCLUSION 


It has been shown that Vossless eymbol correcting message codes can exist only for message 
lengths which have an upper bound, and it can be speculated whether any exist, outside of the cases 
of l-error correcting 2-1 symbol messages, n-error correcting 2n $+ 1 symbol messages, and 3-error 
correcting 23-symbol messages. 


It has also been shown by examples that the more general class of message correcting codes per-— 
mits a higher coding efficiency than symbol correcting codes, and the existence of lossless message 
correcting codes not included within the lossless symbol correcting codes mentioned above appears 
less improbable. 


While the only systematic message correcting codes described in the text is less efficient than 
the Reed Code, it is suggested that an attack along these lines may prove more fruitful than if 
restricted to the sub-class of symbol correcting codes, when attempts are made to design systematic 
codes approaching Shannon's upper bound. 


28 


ERROR-FREE CODING’ 


Peter Elias 


Department of Electrical Engineering and Research Laboratory of Electronics 
Massachusetts Institute of Technology 
Cambridge, Massachusetts 


Introduction 


This paper describes constructive procedures for encoding messages to be sent over noisy channels 
so that they may be decoded with an arbitrarily low error rate. The procedures are a kind of iteration 


of simple error-correcting codes such as those of Hamming! and Golay’; any additional systematic 


codes which may he discovered, such as those discussed by Reed? and Muller’, may be iterated in the 
same way. 

The procedures are not ideal; that is, the capacity of a noisy channel for the transmission of error- 
free information using such coding is smaller than information theory says it should be. However, the 
procedures do permit the transmission of error-free information at a positive rate. They also have 
these two properties. 

(1) The codes are "systematic" in Hamming's sense: they are what Golay calls "digit codes" rather 
than "message codes." That is, the transmitted symbols are divided into so-called "information digits" 
and "check digits." The customer who has a message to send supplies the information digits which are 
transmitted unchanged. Periodically the coder at the transmitter computes some check digits, which 
are functions of past information digits, and transmits them. The customer with a short message does 
not have to wait for a long block of symbols to accumulate before coding can proceed, as in the case of 
codebook coding, nor does the coder need a codebook memory containing all possible symbol 
sequences. The coder needs only a memory of the past information digits it has transmitted and a quite 
simple computer. 

(2 2) The error probability of the received messages is as low as the receiver cares to make it. If 
the coding process has been properly selected for a given noisy channel, the customer at the receiver 
can set the probability of error per decoded symbol (or the probability of error for the entire sequence 
of decoded symbols transmitted up to the present, or the equivocation of part or all of the decoded sym- 
bol sequence) at as low a value as he chooses. It will cost him more delay to get a more reliable 
message, but it will not be necessary to alter the coding and decoding procedure when he raises his 
standards, nor will it be necessary for less particular and more impatient customers using the same 
channel to put up with the additional delay. This is again unlike codebook processes, in which the code- 
book must be rewritten for all customers if any one of them raises his standards. 

Perhaps the simplest way to indicate the basic behavior of such codes is to describe how one would 
work in a commercial telegraph system. A customer entering the telegraph office presents a sequence 
of symbols which are sent out immediately over a noisy channel to another office, which immediately 


reproduces the sequence, adds a note "the probability of error per symbol is 107%) but wait till 
tomorrow," and sends it off to the recipient. Next day the recipient receives a note saying "For 'sex' 


read 'six'. The probability of error per symbol is now 10 ae but wait till next week." A week later the 
recipient gets another note: "For 'lather' read 'gather'. The probability of error per symbol is now 


Lois but wait till next April." This flow of notes continues, the error probability dropping rapidly from 
note to note, until the recipient gets tired of the whole business and tells the telegraph company to stop 
bothering him. 

Since these coding procedures are derived by an iteration of simple error-correcting and detecting 
codes, their performance depends on what kind of code is iterated. For a binary channel with a small 
and symmetric error probability, the best choice among the available procedures is the Hamming-Golay 


single-error-correction double-error-detection code developed by Hamming! for the binary case and 


extended by Golay* to the case of symbols selected from an alphabet of M different symbols, where M 
is any prime number. The analysis of the binary case will be presented in some detail and will be 
followed by some notes on diverse modifications and generalizations. 


*This work was supported in part by the Signal Corps; the Office of Scientific Research, Air 
Research and Development Command; and the Office of Naval Research. 


(45) 


Iterated Hamming Codes 


First-Order Check 


Consider a noisy binary channel, which transmits each second either a zero or a one, with a proba- 
bility (1 - Po) that the symbol will be received as transmitted, and a probability p, that it will be 


received inerror. Error probabilities for successive symbols are assumed to be statistically inde- 
pendent. 


Let the receiver divide the received symbol sequence into consecutive blocks, each block consisting 
of N, consecutive symbols. Because of the assumed independence of successive transmission errors, 


the error distribution in the blocks will be binomial: there will be a probability 


a 
E(oP =.= po) 


that no errors have occurred in a block, and a probability P(i) 


Ne: : N,-i 
P(i) “ih (N, =a)! Pot - PQ) (1) 


that exactly i errors have occurred. 
If the expected number of errors per received block, N) Po is small, then the use of a Hamming 


error-correction code will produce an average number of errors per block, N) Py after error correc- 
tion, which is smaller still. Thus Py the average probability of error per position after error correc- 
tion, will be less than Po: An exact computation of the extent of this reduction is complicated, but some 


inequalities are easily obtained. 
The single-error-correction check digits of the Hamming code give the location of any single error 
within the block of N) digits, permitting it to be corrected. If more errors have occurred, they give a 


location which is usually not that of an incorrect digit, so that altering the digit in that location will 

usually cause one new error, and cannot cause more than one. The double-error-detection check digit 
tells the receiver whether an even or an odd number of errors has occurred. If an even number has 

occurred and an error location is indicated, the receiver does not make the indicated correction, and 

thus avoids what is very probably the addition of a new error. 

The single-correction double-detection code, therefore, will leave error-free blocks alone, will 
correct single errors, will not alter the number of errors when it is even, and may increase the number 
by at most one when it is odd and greater than one. This gives for the expected number of errors per 
block after checking 


<N <N 


1 1 


Ny ey sen e PC) Gel) FG) 


even i2>2 odd i>3 
N 


1 
< P(2) +) (i#1) PC) 
1=3 


N 
< > (i+1) P(i) - P(0) - 2P(1) - P(2) 
i=0 


<1 +N, p, - P(0) - 2P(1) - P(2). (2) 


Substituting the binomial error probabilities from (1), expanding and collecting terms, gives, for 
Ny Po Si?) 
Nope NEN SN 21)pe 
18S Pasay | Po’ 


2 2 
P,<(N, -1)po< Ny) po: (3) 


30 


The error probability per position can therefore be reduced by making N, sufficiently small. The 
shortest code of this type requires N) = 4, and the inequality (3) suggests that a reduction will there- 


fore not be possible if Po >1/3. The fault is in the equation, however, and not the code: for Ny = 4 it 
is a simple majority-rule code which will always produce an improvement for any Py a1/ 2: 

A Hamming single-correction double-detection code uses C of the N positions in a block for 
checking purposes and the remaining N - C positions for the customer's symbols, where 


C= [10g, (N-1) + 2]. (4) 


(Here and later, square brackets around a number denote the largest integer which is less than or 
equal to the number enclosed. Logarithms will be taken to the base 2 unless otherwise specified. ) 
Higher Order Checks 

After completing the first-order check, the receiver discards the Cc, check digits, leaving only the 
Ni = Cc, checked information digits, with the reduced error probability Pp) per position. (It can be 
shown that the error probability after checking is the same for all N, positions in the block, so that 


discarding the check digits does not alter the error probability per position for the information digits. ) 
Now some of these checked digits are made use of for further checking, again with a Hamming code. 
The receiver divides the checked digits into blocks of N,; the C, checked check digits in each block 


enable it, again, to correct any single error in the block, although multiple errors may be increased by 
one in number. In order for the checking to reduce the expected number of errors per second-order 
block, however, it is necessary to select the locations of the N, symbols in the block with some care. 


The simplest choice would be to take several consecutive first-order blocks of N, = Cc) adjacent 


checked information digits as a second-order block, but this is guaranteed not to work. For if there 
are any errors at all left in this group of digits after the first-order checking, there are certainly two 
or more, and the second-order check cannot correct them. In order for the error probability per 
place after the second-order check to satisfy the analog of (3), namely, 


2 2 
aS (UNI Sl) Oa A SS Ufa RL Ge 5 
py < (Nj= 1) py < Nj Phy (5) 
it is necessary for the N, positions included in the second-order check to have statistically independent 


errors after the first check has been completed. This will be true if, and only if, each position was in 
a different block of N) adjacent symbols for the first-order check. 


The simplest way to guarantee this independence is to put each group of N, x N, successive symbols 
in a rectangular array, checking each row of N, symbols by means of Cy check digits, and then checking 
each column of already checked symbols by means of C, check digits. The procedure is illustrated in 
Fig. 1. The transmitter sends the N) - C, information digits in the first row, computes the C, check 


digits and sends them, and proceeds to the next row. This process continues down through row 
N, = C3. Then the transmitter computes the C, check digits for each column and writes them down in 


the last C, rows. It transmits one row at atime, using the first N, = Cy of the positions in that row 
for the second-order check, and the last Cy digits in the row for a first-order check of the second- 


order check digits. ; 
After the second-order check, then, the inequality (5) applies as before, and we have for p>, the 
probability of error per position, 


2 2.4 
Eley 2 Geo: (6) 
The N, digits to be checked by the third-order check may be taken from corresponding positions in 
each of N, different Ny xN, rectangles, the N, digits in a fourth-order block from corresponding posi- 


tions in N, such collections of N, x N, xX N, symbols each, and so on ad infinitum. At the yt stage 


this gives 
0) 1 j 
(2 2 2 2 (a 
py, < Ny Nyep cee Ne Ny Paw: (7) 


It is now necessary to show that not all of the channel is occupied, in the limit, with checking digits 


31 


of one order or another so that some information can also get through. The fraction of symbols used 
for information at the first stage is [2 - (c,/N,)| . At the ko stage, it is 


eee 

Fy = i ( - x}. (8) 
J 

It is now necessary to find a sequence of x for which Py approaches zero and F. does not, as k 

increases without bound. A convenient sequence is 


N, =2 


N. = 27! N= gutn= 1 (9) 


J 


This gives for p,, from (7), 


k 
<7 (2N, p,)’ bi SS (10) 


1 


The right side of (10) approaches zero as k increases, for any Ny Po < 1/2. Thus the error proba- 


bility can be made to vanish in the limit. Note that the inequality gives a much weaker kind of approach 
to zero for the threshold value Ny Po 1/2 than for any smaller value of errors per first-order block. 


For the same sequence of Nj a lower bound on Ee can be computed. From equations (8) and (4) we 
have 


0 Cc log, N. + 1 
roo (-w)- (en) 
ie hae ree j 
00 ajar 3a 
= |] (.- ) (11) 
1 zitn-1 
Let 
vo, 


c=) ss (12) 
1 


Then ¢. is monotonic decreasing in j and is less than 1 for all constructable Hamming codes, that 


n 


is, for Ny =2 24. This makes it possible to write the following inequalities: 


a/o 
a l1-«o,) YS iGxer 
> ea v : (13) 


Here the last term on the right is one of the Weierstrasse inequalities for an infinite product; the other 
terms are useful when ¢ > 1, and show that for ¢; ~ Lands <0, Fe is strictly positive. 


Evaluating © in the present case gives 


spe n+2 2 log 4N, 
ao is 


= NEN eet (14) 


5 
! 
= 


32 


At threshold, that is, at Np, = 1/2, this gives 


o 2 1 je Ve 
c= 4p. log 2 <4 1p, WR oan! Po) log p35 f= 4B (15) 


where E is the equivocation of the noisy channel. Thus for Py small, from (13) we have 


F>1-4E. (16) 


That is, under the specified conditions (N, Dace Wae N, = 27 > 4) the number of check digits 


required is never more than four times the number that would be required for an ideal code, provided 
that an ideal code of the check-digit type exists, which is not obvious. When E is > 1/4, the interior 
inequality shows that F, is still positive. 


Equivocation 


Feinstein? has shown that it is possible to find ideal codes for which not only the probability of error, 
but the total equivocation, vanishes in the limit as longer and longer symbol sequences are used. This 
property is also true for the coding processes described here. This is a very important result in the 
case of codebook codes, where the message becomes infinite in the limit. For the codes under discus- 
sion here, it is a less important property, since any finite message can be received without an infinite 
lag, and its equivocation vanishes with the error probability per position. 


The total number of binary digits checked by the pe checking stage is 
k 
M,. = N,. (17) 


Of these FM, are information digits and the remainder are checks. Using the values (9) for the Nj we 


have 
k(k- 1) 


ih hh (18) 


The bound (10) limits the probability of error per position. Multiplying this by M,. gives a bound on 


the mean number of errors per M,. digits, which is also a bound on the fraction of sequences of M, 


digits which are in error after checking — a gross bound, since actually any such sequence which is in 
error must have many errors, and not just one. Thus for Qi the probability that a checked group of 


M, digits is in error, we have 
k-1 k k(k-1) 


1 i: 


1 
Pies Pic hrs +(+) OE 


At threshold (N, p, = 1/2) this inequality does not guarantee convergence, but for Nip, < Wee, Q. 


(19) 


certainly approaches zero as k increases. 
The equivocation BE, per sequence of M, terms is bounded by the value it would have if any error in 


a block made all possible symbol sequences equally likely at the receiver, that is, 


1 1 
E, < Q, Oeics (1 - Q,) log j} - Q, + Q)M,.- (20) 


Again at threshold convergence is not guaranteed, but for N, Py <4 /2; E.. the absolute equivocation 


of the block, will also vanish as k increases. 


Distance Properties 


At the es stage of fhis coding process, a sequence of M,. binary digits has been selected as a 
message. Because the check digit values are determined by the information digit values, there are only 
M 
ak 


possible message sequences, rather than 2 7 Any two of these possible messages will have a 


33 


"distance" from one another, defined as the number of positions in which they have different binary 


symbols, and the smallest such distance will be 4k for the iterated single-correction, double-detection 
code. This means that by using this set of codes with a codebook, any set of errors less than one-half 
of the minimum distance in number can be corrected by choosing as the transmitted message the mes- 
sage point nearest to the received sequence. 

It is easy to see that for the coding procedure just described this error-correction capability will 


not be realized. Any set of De errors which are at the corners of a k-dimensional cube in the 
k-dimensional rectangle of symbol positions will not be corrected by this process, since each check 
will merely indicate a double error which it cannot correct. By inspecting any two of the sets of check 
digits at once, these errors could be located, but they will not have been corrected by the process as 
described above. The effective minimum of the maximum number of errors which will be corrected is 


therefore 2° = 1, rather than 2oialaee 


This shows a loss of error-correction capability because of the strictly sequential use of the 
checking information. Without going to the extreme memory requirement of codebook techniques, a 
portion of this loss may be recouped by not throwing the low-order check digits away but using them to 
recheck after higher order checking has been done. This does not increase the maximum number of 
errors for which correction is always guaranteed, but it does reduce the average error probability at 
each stage; the exact amount of this reduction is, unfortunately, difficult to compute. This behavior, 
however, points up a significant feature of the coding process. If the maximum number of errors for 
which correction is always guaranteed were the maximum number of errors for which correction was 
ever guaranteed, the procedure could not transmit information at a nonzero rate; that is, the minimum 
distance properties of the code are inadequate for the job. It is average error-correction capability 
that makes transmission at a nonzero rate possible. 


The Poisson Limit 


Much of the above analysis has assumed that Ne = 2a N,, and part of it has further assumed that 


1 
N, = 2". However, any series of N. which increases rapidly enough so that o is finite will lead toa 
coding process that is error-free for sufficiently small values of NjPo: In particular, any other 


approximately geometric series may be used, for which 


eee) 
Nj xb! Nice (21) 


The approximation is necessary if b is not an integer. The expression for Py analogous to (10) is then 


k 
ait 2 -(k+1) 
Px Ny (bN p,)” b , (22) 


with a threshold at NP, = 1/b. The value of o« can also be bounded for this series. At threshold, the 
bound corresponding to (15) is 


o< HI] LO Bl saya o : (23) 


Again, for NiPo below threshold, Q. and E, approach zero as k increases. 
For very small Po» the value of b that minimizes o is b=2. This leads to the maximum value of 
Preven Dy. (16). However, for very small Po? N, may be made very large. The distribution of errors 
in the blocks then approaches the Poisson distribution, for which the probability that just i errors have 
occurred ina block is 
i 
_ WP.) 
a 


-N\p 
B(itece, aie (24) 


This equation may be used to derive an iterative inequality on the mean number of errors per block 
after single-detection, double-correction coding. 


2 4 
=N.p._ (Np2)) (N.ps =) 
NPS vot De emer yelae fre anp, eee  MPiat 


v4 2! 4} 


-N.p. -2N.p. 
: del 7 a eta ee 1 
<a NP ied (26 7 5 ( e ; (25) 


3h 


Keeping Nips constant gives the geometric series (21) for N.. A joint selection of Np, and b for the 
minimization of the bound on o gives NiPo 0.75, b #1.75, and an effective channel capacity 
Bos ite Sten} (26) 


where E is the equivocation of the binary channel. This is an improvement over (16). 


Iteration of Other Codes 


The analysis in the preceding sections has dealt only with iteration of the Hamming single-error- 
correction, double-error-detection code, Other kinds of codes may also be iterated; nor is it necessary 
to use the same type of code at each stage in the iterative process. The only requirement is that each 
code be of the check-digit, or systematic, type, so that its check digits may be computed on the basis of 
the preceding information digits and added on to the message. 

First, the final parity check digit of a Hamming code may be omitted, destroying the double- 
detection feature of the code. This leads to the inequality 

3 2 3} 2 
PjS D (N,-1) Pj.) < 2 vi Pj-1° 
in place of (5). Iterating this code alone gives a bound on o that is only slightly smaller than (15), but 
the threshold becomes Np, = 1/3 rather than N\P, = 1/2, and the effective channel capacity for small 


Py is bounded by 
Be = 65, (28) 


(27) 


where E is the equivocation of the binary channel. 


Second, the Golay” analogs to both kinds of Hamming code may be constructed, for M-ary channels, 
where M is a prime number. If there is a probability (1 - Po) that any symbol will be received correctly, 


and if the consecutive errors are statistically independent, the results of the binary case carry over 
quite directly. The inequalities (5) and (27) still hold for the two kinds of codes, since the errors as a 
whole are still binomially distributed in blocks. At threshold, inequalities (16) and (28) still hold for 
the effective channel capacity, where E is now the equivocation of a symmetrical M-ary channel; that 
is, of a channel in which the probability of an error taking any given symbol into any other different 
symbol is p,/(M-1). The result (26) for the Poisson limit also applies, with the same interpretation of 


FE 
Third, the Reed*-Muller” codes may be treated as check digit codes, and may be iterated to give an 
error-proof system. For these codes, the average error-reduction capability is not known; only the 
minimum distance is known. Certain of the codes, such as the triple-correction quadrupole-detection 
code for blocks of 32 binary symbols, might provide a good starting point for an iteration which proceeds 
by iteration of Hamming codes. The Golay triple-correction quadruple-detection code for blocks of 24 
symbols might be used in the same way. It will take considerable computation to evaluate such mixed 
iteration schemes. 

It is not, at present, profitable to use the Reed-Muller codes for later stages in the iteration. The 
reason is that an efficient triple-correction quadruple-detection code should require about C = 2 log N 
check digits for a block of length N. The Reed-Muller codes require about C = 1+logN+ 1/2 logN 
(log N - 1) check digits for this purpose. For large N, therefore, the effective channel capacity is 
reduced by the large number of check digits required. There is a similar inefficiency in the 
Reed-Muller codes with greater error-correction capabilities, which might be removed if the average 
error-correction capabilities of these codes were known. 


Nonrectangular Iteration 


The problem of assuring statistical independence among the N. digits checked by a ree order check, 


so that the inequality (5) derived on the basis of statistical independence can be used as an iterative 
inequality, was solved above by what might be calied rectangular iteration. Each of the N, digit posi- 


tions in a check group are selected from a different sequence of M,_j consecutive symbols. Thus until 


the ae order checking has been carried out, no two of them have been associated by lower order 
checking procedures in any way. This iteration solves the problem, but it makes M,. a function that 


grows very rapidly with k. When Bb is the geometric series (21), then 


1 
Li (k-1) 
M, = Ny b? (29) 


35 


This means that Pye Q\. and Ey decrease quite rapidly as functions of k, but much more slowly as func- 
tions of the length of the message M,; or its information content Fy M,. 


sare : t 
Roughly speaking, if Hi, = FLM, is the total number of information digits transmitted at the k 


h 


stage, 


_,a(log H,)'/? 
p, ~Ae A INSO, 22 O0- (30) 
3 ; , 
This is a much slower decrease of error probability than Feinstein's result’ which is 


log Hy, an 


~Be? 4 =Be “, B>0, b>0. oe 


A less stringent requirement on the choice of digits checked in a single group is that no two of them 
have been together in any lower order check group. This requires that there be at least Ny different 


groups of order k - 1 from which to select digit positions. Thus 


32 
M, >N, Nyy. (32) 


If it is possible to approximate equality in (32), and if the statistical dependence so introduced does 
not seriously weaken inequality (5), then it might be possible to get the result 


dlogH 
-2 k 


p, ~De es di 10) (33) 


which is closer to Feinstein's result. 
Conclusion 


From a practical point of view, this coding procedure has much to recommend it. A question of both 
theoretical and practical interest is the extent to which the convenience associated with a computable 
and error-free code is compatible with ideal coding, or the smallest price that must be paid for the con- 
venience if the two are incompatible. No answer to this question is in sight at present. However, the 
existence of the error-free process, despite its lack of ideality, puts the burden of efficient coding on 
the first stage of the coding process. For if a coding process succeeds in reducing the equivocation in 
a received message to some small but positive value E, the remaining errors may always be eliminated 
at a cost of 4E (or 3.11E) in channel capacity: an error-proof termination is available, at a price, to 
take care of the residual errors left by any other error-correcting scheme. 


Acknowledgment 


The iterative approach used in this paper was suggested by a comment of Dr. Victor H. Yngve, of 
the Research Laboratory of Electronics, M.1I.T., on the fact that redundancy in language was added at 
many different levels, a point that he discusses in reference 6. 


References 


(1) R. W. Hamming, Error Detecting and Error Correcting Codes, Bell Syst 
pp. 147-160 (1950). : ding lab 2 


(2) M. J. E. Golay, Notes on Digital Coding, Proc. I.R.E. 37, p. 657 (1949). 
(3) A. Feinstein, Some New Basic Results in Information Theory, these transactions. 


(4) I. S. Reed, A Class of Multiple-Error-Correcting Codes and the Decodin i 

1, g Scheme, Technical 
oe ne 44, Lincoln Laboratory, M.I.T., (1953). See also the paper under this title in these i 
ransactions. 


(5) D. E. Muller, "Metric Properties of Boolean Algebra and their Applicati itchi 
) ’ ri pplications to Switch 
Circuits," Report No. 46, Digital Computer Laboratory, University of Illinois (1953). is 


(6) V. H. Yngve, "Language as an Error-Correcting Code," pp. 73-74, Quarter] 
Research Laboratory of Electronics, M.I.T., April 15, 1954. se ee Bebb tiie 


36 


(N,-C,) 


a 
2 
= 
<z 
= 
a 
12) 
ve 
2 


Fig. 1- Organization of First- and Second-Order Check Digits. 


uff 


A CLASS OF MULTIPLE-ERROR-CORRECTING CODES 
AND THE DECODING SCHEM: 


Irving S. Reed 


Lincoln Laboratory - Massachusetts Institute of Technology 
Cambridge, Massachusetts 


I. Introduction 


A procedure for constructing one-error-correcting and two-error-detecting systematic codes 
was introduced in a recent study by R. W. Hamming.t It is the purpose of this paper to exhibit some 
examples of n-error-correcting and (n + 1) error-detecting systematic codes for the cases where both 
the code length and (n + 1) are Semen of two. The class of codes to be considered was developed by 
D. Ee Muller in his recent work. 


The decoding scheme presented in this paper differs from Hamming's scheme in that the en- 
coded message will be extracted directly from the possibly corrupted received code by a majority 
testing of the redundant relations within the code. Hamming's scheme for n = 1 was dependent first 
on the location of a possible digit error in the code; secondly, on the correction of that digit; 
and lastly, on the extraction of the message from the corrected code. By circumventing Hamming's 
step of error location and correction, which is quite a severe problem when n is not equal to one, 
we have arrived at a decoding scheme that makes a natural use of the redundancy within the code as 
well as being conceptually simple. 


In this paper, some of the mathematical proofs of the methods discussed will be avoided 
for the sake of brevity of exposition. A more detailed mathematical analysis will appear elsewhere. 


II. Some Mathematical Preliminaries 


A code having n binary digits may be considered the element of a space, consisting of 2” 
elements of the form 


f= (fp,-- - £4) 


where 
(f, O51) fOr MS S10 5 cause, ope) une 


This space is technically an Abelian group if the sum of any two elements f and g in the space is 
defined as follows: 


£Oe= (fos fys e+e fy) @ (eps | eee Gy) = (LoMOeo,f,O8,,---f, Oe) » 


where £,@¢, is the sum modulo two of the binary digits f; and g; for (j = 0,1,2,...n-1). If 
multiplication by the binary scalar a is allowed as 


af = a(fy, Eiseee £1) = (afp, Af,, eee af, 4) 


the Abelian group may be termed a generalized vector space of n-dimensions or a module. Finally, 
if the product operation 


f+ ge (fost, sof) * (Ugr8ys +e By) = (Lpegsfie)5 +f 18-4) 


for f and g in the module is introduced, the space is a Boolean ring. The prime operation is de- 
fined to be 


f'e fQ@I 


for f in the ring, and where I is the identity vector (1,1,1,...1). 


Into this space one may further introduce a norm or length of a vector as follows: 


Dieccwels 
Fa] i Zz f, 
i=0 


38 


where 2 refers to ordinary addition. It is not difficult to see that the norm of the sum of two ele- 
ments f and g in the ring or ||f ®g\\ is precisely the Hamming distance D(f,g) as defined in Ref. l. 
® 


Now let n the dimension of the vector, space be a power of two or n = 2™, Let a vector of 
this space be of the form 


fr (fo fi; coe ‘im ) ’ 


il 
where a is a binary digit for (j = 0,1, ... 2™-1). Now the vector f may be clearly expressed as 
bp) 6 Pare OO) pa ey intr: wee all , (1) 
0-0 ibe 5M 1 oy 


where I. is a wnit vector with the digit one in j-th coordinate of the vector and zeros elsewhere 


for (j = 0,1, ....2™-1), Further, each unit vector Ij can be determined as a product of m vectors 
from the set of 2m vectors X} 9 X25 X35 coe Xs x» > x}, coo X15 where x, is a vector consisting 


of alternating zeros and ones, beginning with zero; x» is a vector consisting of alternating zero 
pairs and one pairs, beginning with a zero pair, and so forth, as follows: 


£ P= (071 O11 Ot O15 0-2), 
meee (O20 UL NOLO ie Shae 


x,=(00001111...11) ; 


x, 7 (00000000...11) A (2) 


ake. , 
ibe x; is defined to be x for i, = 0 and x, for i, = 1, then by the rules of Boolean algebra, 
ak i 


a 2 m 
I, Xp Xy eee KE, fs (3) 
where 
4 k-1 
bse i, 2 with (i, = 0,1) for (j*=tO 0 ees Maal uae 


Combining Eqs. (1) and (3), we have 


27-1 Ls i 
f= & bie Se be ae tae (4) 
j=0 Sikes wes 2 m 2 


where i,,i, : coed, are the digits of the binary representation of j, and where the summation sign @ 


is with respect to the sum operation@ . Equation () is the canonical expansion of any vector f in 
the Boolean algebra of 2™ dimensional vectors, consisting of binary digits. 


If the identity x! = I@-x, and the distributive law of algebra is used, Eq.(l) may be ex- 
panded to obtain the Soulcaine polynomial in the Sees 


£ = a Oe % © +O oH, @ £9 4% O +e Sn-1,m*n-1 *n © +++ 


+ @Oeo . m% 8 me (5) 


39 


Equation (5) can be written more explicitly as 


2 
= Tacs Oren Kegel yl ea al (OP Seen 0) ND COP AO) 
f= £(0 pee f( )x, ® : ( )%, Os ( XX 


©) ves Ol ee (0,5) Oa (6) 
12...m 
where 


m 
3 ctw k-1 oe 
£(i, ++-4,) f, when j oe ine for 3, = 031. 


and the A's are mitiple partial differences, for example, 


A £(0)- =*£(7,0,0, wae) &) 1C0,0,0;ue50) es 
af 


2 
A f£(0) = (£(351,0,<..) @ £(0,1,0, nee) @) [£(2,0,0, eee) ) £(020,0, rece é 
12 


and so forth. The polynomial representation in Eq.(6) of the vector f supplies the relations be- 


tween the coefficients of Eq.(5) and the scalars f; of Eq. (lt) for (j = 0,1,2,... 2M-1). This 
definition of the A's will be expanded in another Section of this paper. 


TIT. fhe Generation of the Multiple Error Allowing Codes 


Suppose that the dimension of the space considered in the previous section is 2™. Consider 
the set ar of all polynomials of the form (5) of degree less than or equal to r where r< m. Each 
such polynomial must have the form 


By © 81) @ 000 8% © 0 Oba pM ree HO --- O Bm-r+1, som 


x x 
mr+ls.cm — 


(7) 


and the sum of any two such polynomials is a member of the same set. This implies that A the set 
of all polynomials of type (7) or of degree less than or equal to r forms an Abelian group or sub- 
module of the Boolean ring of 2™ dimensional vectors. Since ot is a module, the Hamming distance 
between any two elements of &™ is the norm of a third element of 2% - This fact was exploited by 


D. E. Muller* in proving his Theorem 25, Muller's Theorem 25, in our terminology, may be expressed 
as follows: 


Theorem A:- The norms of all non-zero vectors f of &™ satisfy 
r 


mr 


Wee for (m= 0,1,2,...) andr<m. 


We shall not prove this theorem here. It suffices to say that Muller proved the theorem by an in- 
duction on m and r and the properties of the Hamming distance. 


By the above theorem there is at least a distance 2™” between two elements of er and, 
as a consequence, there is an open Hamming sphere of radius 2™-%1 about each element of $M in 2% 


(the whole vector space) which does not intersect any other such sphere. This means that it is 
possible to associate each element of such a sphere with the element defining the sphere or what is 


the same to associate an element of Ys: which is less than a distance 2™T-1 from an element f of 
e™ withf. 
a4) 


In order to illustrate how a message may be coded into an error-detecting code of the, type 


described above, consider the following example: Let m= and r = 1, by (7) the vectors of gl are 
of the form 1 


& © 4x © ox @ 23x, @ gx, - (8) 


ie) 


Let the message consist of the five binary digits (g95 B18» 835 g), )e The code space e+ may be : 
regarded as generated by the four vectors % 2%» X3 5 x, and the identity vector I which may be written 
explicitly as follows: 


(0/0 ed ONO MeO Oras! O 1e1)0 


oe (OO OOO 010 01) 5 o, 
ao 


(Ov Oc; 0208 adic 29070 1OsO8tehele)) eer 


rape 
xy, = (0000000011111111) F 
gimme Seek Gils ees en ee oes Br asics Roe a bony 5 (9) 


The 32 vector codes of & 4 can be obtained by scalar multiplication of the vectors of (9) by the 
Message digits Bos By By 08328), in accordance with (8). For example, the message (0 110 0) has the 


code vector Bx (©) BX or 


COV I71000 2 10“0. Nr Oia 0), 
Each of the 32 codes will be a distance of at least eight from each other. 


In order to practically generate the above code, one should note that the vector x; is the 
sequence of digits generated by the least significant binary stage By of a binary counter of scale 
sixteen; Xj is obtained from the second stage Bo; x3 from the third. stage B3; and from the final 
stage B),, as the counter goes through one period of its operation. If the message Eo2 8» 28338),) 
is stored in a binary register with stages Ay sy sy A354), then the switching function 


C= A, © AB) @ AB, © AB, © AB, 


will generate the code sequentially during one period of operation of the binary counter. 


If one of the above codes of ait is corrupted during transmission so that no more than three 


errors are made, it is evidently possible by the previous discussion of this section to somehow ex- 
tract the original message from the corrupted received code. The method by which this extraction may 
be accomplished will be shown by example in the next section and in general in the last section. It 
should be clear from the above example how the vectors of 5% may be generated for arbitray r and m 
where r< m. 


IVe Decoding Corrupted Codes of ie by a Majority Testing of Redundancy Relations 


Let us first consider the coding space & 3 By (7), the vector of this space has the form 


a ios 
&1@ 4% @O ex © gx, - (10) 
The message will consist of the four binary digits (2028) 28 283)» and the generating vectors of the 
space are 
x * (OF 051 OtasOrD) yy 
nay (ueraba okays 
x3 = (O-0 O70 tleleie Lams 
Te eat Piel 1) ee (11) 


By (6) we have the following set of relations for the message digits B, in terms of fs the code 
digits. 


u 


=e) = eA e ie, eco 0) fy ’ i £(0...) = fy ) f, C) f, © f3 =O > 


Q 5 Atom) =i, Of) > dia si 5, O11 Oe = OCs 
& * seve =f @ £, 3 a) £(0..0) = fp @ 'e () fy, @ iar OO 

= SAct(Occe) wre fy 5 Ac f(Oics. i= j fon nO mee (12) 
= 3 0 Of, 123 1-0) ke 


By (12) there are four relations which g, satisfies, 
g, ~ £5 © 4, > £,@f,- f), Of, =£,@ 1,01, Of, @f,@f, : 
By substituting the second and third relations into the fourth relation, we have 
& = 8 O2,Of,@f = 0@f@f, = Of, .- 
Thus we obtain the four independent and disjoint relations for & » 
& “f)@f,"f,@f,-°1,Of,-{.@f, - 


These four relations are disjoint in the sense that no two of the relations have variables in common. 
In a similar manner, we may obtain four independent and disjoint relations for both E> and 83 so 
that B) 985983 may be expressed as 


& ~f) Of =f, Of, -1, Of -f.0f, , 
B, cee OS Hi O@f,-* 1, Of, - 1,08, A 
@; = fp Of, = £1, Ofg= fp Of, =f, OF, % 


Let us now suppose that the received code is the vector (fo,f,,---f ). If there were no 
error in transmission of the code, all of the above relations would hold. If there were one error, 
three out of four of the relations would hold. If there were two errors, at least two of the g,'s 
would have two out of four incorrect relations. Then B1>,89s83 may be determined uniquely if ond 
or no error occurred during transmission, and two errors may always be detected by making a majority 
test on the arithmetic sum of the values of the four relations for each g.(j = 1,2,3). In order to 
state this criterion more explicitly, let the values of the four relations for 85 be denoted by 


P53 9F 50975397 5h for (j = 1,2,3), and let S; be the arithmetic sum of PyyoP job 33975, OF 
hu 
Se = 2z = e 
doar 
Then the majority decision test for ee 
. = 0 abe OS Has ‘ 
ej She a 
g, is indeterminate if S, ar 2), 
g, 72 if 2<S,< ye fore (Gums see3 ee ae . (13) 


2 


With the assumption that the received code is no more than two digits in error, the majority 
test (13) will determine g,,g, »&3 uniquely for only one or no errors, and reject the code as meaning 
less in the case of two errors. ~In the case of one error or less, 8 98983 may be assumed now to be 
determined; it remains to determine gg. In order to find go, note that if, as 198283 are found, the 
vectors 8X, s85%) 283%, are added successively to the received vector, by (10) we will énd with either 
the vector goI in the case of no error or with a vector of distance one from ole Thus to detect By 
the following majority decision test will suffice: 


7 
gsO if 2a<h , 


go 1 
7 

RAG yy oul (24) 
4=0 


where m; are the digits of the code after extraction of digits B 2B 283 in accordance with the above 
procedure . 


The above method of decoding may be illustrated by the following example: Suppose that the 
message sent was (1 01.1), and that during transmission an error was made in the fifth digit of the 
original code (11000011) so that the received code had the form (11001011). We first 
test for g 58583 by (12) and find & = 0, @& sland g,*1. Using (11), we add g,%, @ &%, @ 83%; 
to the code, obtaining 


0101 0101012)@(0012 0011) 6 ©0001 111) @.G 100 o a1) 
= (11110111) = (m,m ,m,,... m3) . 


Finally, by (14) 
Th 
=] since 2m 27>) . 
0 : iso 7 


Although a; is none other than an example of a set of one-error—correcting and two-error- 
detecting codes of the type described by Hamming in Ref. 1, the method of decoding considered above 
is different. Our procedure of decoding is advantageous in that it may be generalized in a natural 


way to include any of the coding spaces 8 ™ of the second section of this paper. Before we consider 
the generalization by further examples, let us note a tabular way of representing the redundancy 
relations. 


If the digits or variables of each relation are connected by lines for each of the vectors 
“a aa 
SEO MO, LAO PORN)E 


a ee ee 
X= (OVO GL OL Omi haa 


ocean ae 
(15) 


the relations of (12) become almost self-evident by their simplicity with respect to order and sym- 
metry. This simplicity makes it possible to discover the redundancy relations for more general 


spaces ae without resorting to the algebraic approach used above. 


As a second example of our decoding procedure, consider the coding space zi intro- 

duced in the latter part of the preceding section. Each vector of this space has the form of (8), 

where the generating vectors are X) Xp 9 2%, and I of (9). The first-degree redundancy relations 
the 


may be determined in a manner gimilar to above example and represented in a tabular manner 
Similar to (15) as follows: 


43 


% F (O00 0.0 Oia sms anes 0 1) 


’ 


ee GE Gi ee IN or es ANE 
¥2 =(0)-0, el’ 0.6 Seite CeO nnen iad ahj 


¥, SOOO O11, 1 ee agouiey ail, 
41 ( CHO OS OROMOMO MOM INNIS inl 


OS iar iy 


, 


(16) 


For instance, the eight independent and disjoint relations for g, are 


ca z fo4 @ fo 441 for (i s 0,1, eee 7) ° 


If the eight values of the redundancy relations for g, are labeled Psy oT yoo ee el 4g for (j = 1,2,3,4), 
and S j is defined by 


8 


SH, ph Ss ’ 
re 


then, by an argument similar to that used in the previous example, the majority decision test for 
8, is as follows: 


g; = 0 if O<S,<h , 
B is indeterminate if eae i 
g,=1 if 4<S,<8 for (j = 1,2,3,l4) (17) 


In order to determine g, we first add the determined vectors g to the received message, 
assuming, of course, that no gy is indeterminate, and we are left with the zero-degree polynomial 85 A 
possibly corrupted by errors. If there had been no errors, there would be sixteen zero-degree re- 
lations which & satisfies, or 


B= 3, for (j = 0,1,2,...15) , 


where, as in (1h), m, are the digits of the code after extraction of By 285983 and g,. Thus g, is de- 
termined by the majority decision test 


15 
& = 9 eben ker 3 
15 
= leo e eS - (18) 
4=0 


For the above example three errors may be made in the code and the correct message obtains. 
If four errors are made, some of the message digits are indeterminate. It is of some interest to note 
that, for some cases of five errors in the code, the message may be extracted correctly. For example, 
suppose that the message was (0 0 0 O 0) and that the received code was (1100101010000000}) 
Clearly, the correct message will be extracted from this code by the above procedure. 


As a final example of coding and decoding scheme, consider @ 4 « This space is generated by 


X 9Xp 9X3 9%, of (16) and I, as well as the quadratic variables XX 9% X3 9% Hj, 9X QX3 9 XX, 9 XX, « The 
latter six vectors may be presented in the following tabular manner. 


(OTE <= 
x X.= (O80 BOUL On Owl 204050 wleOOLO Sl) ue 

a A GP wage 
KiXq = (0102000 0k cOplnO. Uu0mouOu eOLayal) 


¥ 4X4 =1(6),0 6 208060: Oh On OA Onl Onl BOn1) a 
5%, = (6 fo DOO 6 oO 0 oo 7 heer 
x5x,=(6 00006 5 On OR Onde 10505 Vat) eae 


3%, = (0 0 0 i "i 0 : ; 0 ; 0 3 , ae Ay ae (19) 


The messages for this example will be 11 binary digit numbers of the form (g,,2, 58558258) 98598; 298),9 
B 3985), 983),)° Each code will be sent as a vector of the form 70? Fite a 12?°13?° Us 


B © 81x, @ 2.x, © 23x, @ 8%, @ eoHx @ 254%, O 04%, 
® &3%H%3 O 2%, O ey xy, + 


The second-degree coefficients g;, of the received message are extracted first with a 


majority decision based on the redundancy relations illustrated in (19). Next, assuming that no 
indeterminacy occurred in the second-degree coefficients, the vectors g; x{xX5 are added to the re- 
ceived code, after which we are left with a residual code from which the first-degree coefficient g 
may be determined by test (18) after adding the Vectors 8X) 5 BX 983X358) %), to the residual code. 


This example illustrates the general principle of decoding the particular class of codes 
under consideration. The highest degree coefficients of a received code are extracted first; then 
these terms of the polynomial are subtracted out of the code, thereby leaving a residual code of the 
next lower degree than the original code in the special case of no errors. The operation is repeated 
over and over on the successive residual codes until either an indeterminacy occurs or until Eo is 


extracted. 


The relations of (19) illustrate the fact that there are four redundancy relations each of 
four variables for the second-degree coefficients 835° For example, the redundancy relations for 


Bio are 
Bo * fyi © tyivr © tyivo © tyivg for 49 2,2, 3). (20) 


In general, these relations will allow only one error; two errors will lead to indeterminacy. This 
is another example of Hamming's one-error-correction and two-error-detection codes. 


It should be noted that the majority decision tests used in the above examples were, in 
general, overdeterminate. For instance, in the first example, if one error had been made, no more 
than one error would remain in the residual code after determining 8} 1&5 983° On the other hand, 


if two errors had occurred, the process of extraction would have ended before B could be determined. 
Thus a test of only the following type would be necessarys 


Z = O ifm,+m;, + m3 es 

Byer ER Bio eae es 
where 1,541,513 are any three distinct numbers between zero and seven, inclusive. Refinements such 
as this, however, do not destroy the validity of the previous tests. 


h5 


V. THE GENERAL DECODING PRINCIPLE 


To study the general decoding scheme, illustrated by example in Section IV, it will be 
necessary to consider the general m1tinomial expansion formula (6) more carefully. Let us first 
define the multiple differences, used in (6) in more detail. 


As in (6), £(ij 5 eee i) is defined as 


5 k-1 
f£(i,,++64)) LS oe when s Ce py Sik, By 


for (i, = 0,1) ; (21) 
poiguk k ’ 


The general multiple partial difference 


f(i,,45, sea i) 


A 
Ky ok seek, 


is defined inductively as 


a Be Ey sore t. eS £(i, 5-004, 1 54,@ ily, Sygqor eed) O LCi, se eed yee 0d) 


p pat 
A £( geced, ) = A f( ge00 9 ) qd; provera: ) 
Ky ky geek, 4 ” Ky seeek a ra rs zk we as 


prl 
@® a £(i,,--.4,) (22) 
Ky greek y a 


With these definitions it is possible to prove by induction the validity and uniqueness of expansion 
(6) for any Boolean algebra of m variables, and in particular, for the Boolean algebra of 2™ dimen- 
sional vectors as described in Section II. 


One evident consequence of (21) is the identity 


f(A seed yd, © Ls apereedy) * fyyceayteokr? (23) 


By the use of (23) it is possible co write (22) explicitly in terms of the f, as 


gtd) =f, O Lacy Sit 


and 
A cif Ci Aree hp Ch CALs gear Olea Haan i k -1 
iy ak seeeke, “1 SE hee che, “Kyo P 
where 
pl opst if kee 
4 £(4, 5+-e4,) = 2 f; and j,; A ges (22) p2P 
ky 9k 90+ ek tates 
p-l 


for (i,s = 1,...2P-4) 


. (2h) 


46 


We are now in a position to prove the following fundamental theorem on which the general decoding 
principle of the class of codes under consideration rests. 


Theorem B:- awe BSc? or r-th degree coefficient of any vector or polynomial f of a0 
satisfies exactly ons ¥ disjoint relations where each relation has precisely the form 


a 
Bf ; 
=1 
“* F) 
where i, are distinct numbers from the set (0,1,2,...2" - 1) for (k = 1,2,...2"). Disjointness of re- 


lations means'that no two relations have variables f 5 in common, 


Proof:- Choose mand r. By (6),(7) and (2h), the highest degree cbefficients for an f of a 
are r) 


as 
x 
g = Beet (0,560) = Sf (25) 
eee ky, ie tte : iu) se 


where k, are distinct cee from the set (1,2,...m) for (j = 1,...r), and j, are distinct integers 
from the set (0,1,...2"- 1) for (i = 1,2,...2"). Moreover, 


A £(0,...0) = 0 (26) 
kK ees My oe oy, 


for t >1, and K; and n, are distinct integers from the set (152 )5 se sm) CLOF A jumed rena tye 


Let k, ,k, +e. k,, be a distinct set of integers from the set (1,2,.e-m). Then by (26) and (22), 


r+1 r Yr 
NeeetiO.e. 0) es A ££(050620)@), abu flO Noel's ee. 0)e=10 (27) 


kj ook k ++ ek, koe ek, 


where ny is any one of the m-r integers from the set (1,2,...m) which is distinct from the integers 
(k, > ky aeeek,)e Thus, by (2) and (25), we have exhibited m-r new relations of the form required by 


the theorem. Each of these new relations is distinguished by the fact that the digit one appears 
only in the n)-th position of the function f(i,,--e4,, ) operated on by 


r 
4 


k- e Kk. 
Now define f [ry snp seeeny | to be £(4, sinsee ed) with i, * 1 for k = Ty My p00 eT and i, = 0 
otherwise. The theorem will be proved by induction on the subscript of n. Assume therefore that 
rts-1 rz 
A £(0,0,...0) = Bb © £(0,0,.3.0) 


Ik, eeek, nN + +My) Ic, o00k,, 


x¢) 
@ A i [n,n oo eDs_ (28) 
ky o+0k,. ha 


7 


Now, by (22) and (26) and the induction hypothesis (28), 


r+s r+s-1 
A £(OL0 i. = A A £(0,05 0240) 
k)-- +k n ---n, ny Ky eee 5-7 
r r 
s A A £(0,...0) ) A EA ees | 
ns ky oe ek, kj) ++ ek, 


x ( ) e ne [ ] 
= A Ae 0,0,...0 A if: Dh sec et oi. 
ky seek, ++ ek,, . 


Mere tI] ante, t[msernm] 0” 


Now, by (27) and (28), the two middle terms are equal to 


= 
A £(0,0,...0) 


ky oe ek, 


and therefore their sum modulo 2 is zero. Hence 


r+s = od r 
A). £(Oye. 30). = Ay ECON Ol ‘ £ leone = Obs; 


kj) ooen, kj + eek, Ky eee r 


and the induction is complete. The theorem is proved when we observe that the relation 


ms 
A £(0,...0) = 0 contributes ee iy distinct relations, 


ky oo en, 


Noe BORA DEE ed £[n,,n,---n, | ’ 
re 


ky + ek, Ky eee 


since there are ie) ways of choosing s integers from m-r integers. Using all the relations (26) 
for the particular set k++ -k,, and t = 1 to t = m-r and the-relation (25), we get 


distinct relations for 25 kK, “° Since these relations exhaust all variables oi the theorem is 
’ gece r 


proved. 


1,8 


The above theorem shows that the generalization of the decoding principle, discussed in the last section 
obtains. The majority decision test for the general case can clearly be used to extract the r-th degree 
coefficients of £ es where the relations used for the test are the 2™* relations of Theorem B. The 


(x-1)-th degree coefficients are then extracted the same way after. the determined r-th order terms have 
been subtracted or added into the received code. This process is continued for the r-2,r-3,... degree 
coefficients until the message is extracted or an indeterminacy is reached. 


VI. Concluding Remarks 
Since there are ey j-th degree coefficients ea j., im expansion (5), there must be 
Daas 


J 
r 
N= f ee 
i=0 = 


coefficients in each polynomial (7) of the coding space a The coefficients of (7) constitute the 
message sent, thus each code of oa contains N bits of message information. Since each element of 


a is a vector of dimension o there are 2™-N bits of the code used to supply redundancy. 


In order to illustrate the relationship of the number of message bits to number of errors 
corrected, consider the coding space ef - By (29) each code of (29) has 99 bits of message inform- 
ation for a code of 128 bits. By Section III at least 


omr-l 4 2 gt-h-1 ee 


bits of error in the code can be corrected. By Section IV and Section V four bits of error will lead 
undoubtedly to an indeterminacy in the message and it is likely that in some cases of five errors the 
correct message will be extracted by the majority decision process. Further examples of the numerical 
relationship of message bits to number of errors corrected may be constructed in a similar manner. 


Attempts have been made with little success to investigate the structure of the complete con- 
vex set S of points, containing an element s of oD whose points correspond to the element Ss under the 
majority decision test procedure of Section V. As the second example of Section IV shows, there are in 
general more points in S than in a Hamming sphere of radius pera containing 5. These attempts were 
motivated by a desire to show that the coding system discussed here would satisfy Shannon's fundamental 
theorem for a discrete channel with noise (Theorem 11 in Ref. 3). So far, this fact has not been shown. 


There are two generalizations of the codes discussed in this paper. In Ref. 2 Muller dis- 
cusses generalizations of the binary codes, discussed here, for lengths other than 2™, Another 
generalization is possible where the polynomials considered here are considered over a field of 
characteristic other than two; i.e., ternary codes, etc. It will not be the purpose of this paper 
to investigate these generalizations. 

\ 


ACKNOWLEDGMENTS 


The author expresses his appreciation to E. B. Rawson for 
his assistance in the construction of the second example 
of Section 4; to G. P. Dinneen for his help in the simpli- 
fication of Theorem B; and to T. A. Kalin, W. B. Davenport, 
D. B. Muller, and O. G. Selfridge for several useful dis- 
cussions. 


REFERENCES 


1. R.W. Hamming, Bell System Tech. Je 26, Now 2, 147 (April 1950). 


2. OD. E. Muller, "Metric Properties of Boolean Algebra and Their Application 
to Switching Circuits," Report No. 46, Digital Computer Laboratory, Univ. 
of Illinois (April 1953). 


3. oD. EE. Shannon, "A Mathematical Theory of Commnication," Bell System 
Tech. J. 27, (July,October 198). 


9 


CODING FOR CONSTANT-DATA-RATE SYSTEMS* f 


Richard A. Silverman and Martin Balser 
Lincoln Laboratory, M.I.T. 
Lexington, Mass. 


A. INTRODUCTION 


We consider a communication system in which data consisting of sequences (known 
as words) of binary digits are transmitted at a predetermined constant rate. (For our purposes, 
a binary digit is one of two electrical signals of duration T and bandwidth W.) The nature of the 
data and the manner in which they are translated into words are irrelevant to this discussion. 

For example, the data may be English letters, numbers, etc., reduced to sequences of five 
binary digits each for use in a teletype system, or they may be conventional symbols represent- 
ing entire messages. 

A basic problem of coding is to reduce the average rate of incorrectly received words 
as much as possible. Accordingly, additional digits are added to the word for the purposes of 
error detection or correction. The assumption that the words are sent at a constant rate requires 
that each binary digit be shortened by such an amount that the coded words (message digits plus 
check digits) have the same duration as the original uncoded words. This shortening of each digit 
increases its probability of error and, consequently, the probability of error per word. On the 
other hand, the coding imposes constraints on the digits composing a word, so that errors may 
show up as inconsistencies and may in many cases be corrected. This tends to reduce the prob- 
ability of error per word. The efficacy of a code depends on how much the second effect out- 
weighs the first. 

The simplest code of all consists of an extra digit selected to make the sum of all the 
digits in the coded word even (or odd). If the sum of the digits of the received word has the wrong 
parity, an odd number of errors is known to have been made. There is, however, no indication 
of the correct replacement for the mistaken word, unless some dependence between separate 
words (such as the redundancy of printed English!) is exploited. 

Hamming“ has devised a code that corrects all single errors. It consists of adding 
k suitably chosen check digits to the m message digits. If another digit is added, double errors 
can be detected as well as single errors corrected. ‘ In this paper, we describe a new single- 
error-correcting code (the Wagner code) and evaluate its performance in a constant-data-rate 
system, particularly as compared with that of the Hamming code. 

The principle of the Wagner code is readily extended to the construction of multiple- 
error-correcting codes. We evaluate the performance in a constant-data-rate system of two 


such codes and of a code recently developed by I.S. Reed. 


B. DESCRIPTION OF THE WAGNER CODE 


In this study, we are concerned with communication systems that transmit words con- 
sisting of binary digits. A binary digit is one of two electrical signals x, (t) and x(t) of duration 
T and bandwidth W. Let p(x; /y) be the (a posteriori) probability that if y is received, x; was 


*This paper is a condensation of two papers of this title, Part 1, A New Errce-Correcting Code, by R.A. Silverman 
and M. Balser, and Part Il, Multiple-Error-Correctina Codes, by M. Balser and R.A. Silverman, which will appear 
in the Proceedings of the |.R.E. Further details and derivations which are omitted from this condensation are to be 
found in the more complete papers. 

tThe research in this document was supported jointly by the Arm;), Navy and Air Force under contract with the 
Massachusetts Institute of Technology. , 
50 


sent, and let Ap be p(x,/y) ~ p(x,/y). In the absence of any constraints on the digits composing 
a word or of dependence between the words themselves, the receiver can compute only p(x,/y) 


and p(x,/y), and for each digit choose xX, or x depending on whether p(x, /y) or p(x,/y) is the 


larger (or, equivalently, whether Ap is ea age or negative). The error-correcting code that 
we shall describe (named the Wagner code after C. A. Wagner of this laboratory, who suggested 
the basic idea) enables us to use some of the information presented by the magnitudes of the 
Ap's* by introducing a constraint on the digits composing a word. This information is ignored 
by more conventional codes. 

In the Wagner code, a transmitted word consists of a sequence of m message digits 
and an additional digit used as a parity check. As each of the perturbed digits y arrives at the 
receiver, the a posteriori probabilities p(x, /y) and p(x,/y) are calculated. Each digit of the re- 
ceived sequence is tentatively identified as x, Or x5, depending on whether p(x,/y) or p(x,/y) is 
the larger, and the values of the a posteriori probabilities are stored in a memory for the dura- 
tion of a word. The sequence thus obtained is checked for parity. If the parity is correct, the 
word is printed as received. If the parity check fails, the digit for which the difference Ap be- 
tween a posteriori probabilities is the smallest is considered the digit most in doubt, and the 
word is printed with this digit altered. The receiver then clears the stored values of the proba- 
bility differences from the memory and proceeds to the next word. 

Thus we may characterize the Wagner code as one which probably corrects single 
errors. (Multiple errors are always printed incorrectly.) However, as we shall see, it can be 
more effective in a constant-data-rate system than a code that corrects all single errors (such 
as the Hamming code). 

The a posteriori probabilities p(x,/y) and p(x,/y) are functions of the random received 
waveform y(t) and therefore are themselves random variables. The calculation of their distribu- 
tions is, in general, very difficult. For simplicity, we shall consider the case where the two 
transmitted signals have equal energy and equal a priori probabilities and are perturbed by the 


addition of white Gaussian noise. It has been shown? that for this case 


at 
p(x, /y) = B exp ly fi x(t) y(t) at| (iy 
fo) 
and 


Ap 
p(x,/y) = B exp E if xz(t) y(t) at ; 
Oo 


where f and y are constants. Thus the transmitted signal with the larger correlation has the 


larger a posteriori probability. Equivalently, we may write 


p(x,/y) ; 
ae a se 
where 
fe T 
Z = ff x(t) y(t) dt and Z> = i x(t) y(t) dt (3) 


*The signs of the Ap’s are used in making the tentative identification of the transmitted. word. 


SE 


From Eq.(2), we see that the smaller the difference Az = 2) 2 between the correlation inte- 
grals, the closer to unity the ratio of a posteriori probabilities and, consequently, the smaller 
the difference Ap between the two probabilities. Thus, if the parity check fails, the digit that 
should be changed (as the one most in doubt) is the one for which Az is the smallest. 

It can be shown that Zy and Zz are normally distributed random variables (with 
1 1 and v5), so that calculations are especially simple for the 


correlation detector.” Moreover, under the assumptions made above, the correlation detector 


means c, and Cy) and variances ¢ 


is equivalent to the probability detector. Therefore, it is assumed in what follows that detection 


is by correlation. 
C. ANALYSIS OF THE WAGNER CODE 
1. Probability of Error Per Digit 


As noted above, the correlation integrals Zy and Z> (corresponding to the signal that 


was sent and the signal that was not sent, respectively) are random variables with probability 


densities 
1 Cae air 
W(z,) = iy eel bag (4) 
NPA o} 20) 
and 
1 (25 
W(z5) = exp - aes oso cieas : 
N20 v5 20, (5) 


If x) and X, are suitably chosen, Zy and z, may be regarded as statistically independent random 


variables. Accordingly, the probability density of finding a separation Az = Z)— 2, is 


W(4z) = exp - aeoeab| ; (6) 
ane 20 
where Ac = Cc) —Cc, > 0 and = = oa ct aise If Az is negative, then selecting as the transmitted 


signal the signal giving the larger correlation integral will result in an error. Thus the probability 


of error per digit is 


p(a) = Ae Ww(Az) dAz = + (1 —erfa)=1-q(c) , (7) 
00 


where a = Ac/N 2c. The parameter a, which is proportional to the signal-to-noise ratio of the 


*For simplicity of notation, we shall always use the subscript “one” for the signal that was transmitted. Thus 
C) >, and Z1< 2 results in an error. 


52 


correlator difference, is the significant parameter in the calculations that follow. 

Since we do not know which digit was actually sent, we do not know the sign of Az. 
From Eq.(6), the joint probability that the correlator difference lies between | Az| and |Az| + 
d|4z|, and that the larger correlation integral corresponds to the transmitted signal, is d| Az| 
times 


W(| dz|, right) = exp - aac a . (8) 


ov8 2o 


On the other hand, the joint probability that the correlator difference lies between |Az| and 
| Az} + d|4z]|, and that the larger correlation integral does not correspond to the transmitted 
signal, is d|Az| times 


W(|4z|,wrong) = exp - au ak | : (9) 


2. Probability of Correcting a Single Error 


2710 


Suppose that the received word has n digits. Since the parity check fails if a single 
error is made, and since then the Wagner code changes the digit with the smallest |Az|, the 
probability II (¢) that a single error is made and is corrected by the Wagner code is just the 
probability that the digit with the smallest |Az| is incorrect and that the n — 1 other digits are 
correct. II (a) can be calculated as follows. 

Let | Az.,| be the correlator difference for the i-th digit. Since the | Az.,| are inde- 
pendent random variables, the joint probability that the first digit, with correlator difference 
between |4z,| and |4z,| + d|4z,|, is wrong, and that all the other digits, with correlator differ- 
ences between |Az,| and |4z,| +d|4z,|, ..., |4z,| and |4z_| +d|4z |, are right, is 


n n 
W(|4z, |,wrong) [I W(|Az,|,right) IT djaz,|. (10) 
1=2 i=1 
Thus the joint probability that the first digit is wrong, that the n - 1 other digits are correct, 


and that |4z,| < |4z,| <... < [Az | is 


00 [dz | 
ifs W(|4z, | right) dj dz, | Ha W(|4z__,|,right) d|4z__,| 
(11) 


|az,| | |oz,| 
ca W(|4z, |, right) d| Az, | i W(|4z, |, wrong) d|4z, | 
oO @) 


Since there are in all n! orderings of the n correlator differences, the joint probability Th (*) 


that a single error is made in a word of n digits and that it is corrected by the Wagner code is 


' si *n 2 
TI (2) = aap df exp [- (x, way dx, a exp [- ee a)“] dx 4 


x 
bree alin exp [- (x, af] ax, f 2 exp [- (x, +af] dx, 5 (12) 
° O 


given by 


53 


where a = Ac/N2o as in Eq.(7). Equation (12) can be reduced to the following form, more suit- 


able for computations. 


n 
I (a)ciena sem urea Wicd laealila? (13) 
ry EY 
where 
I (a) = =f ters (a— a)j"-! exp [— (x + a)“] dx a (14) 
7/0 


D. PROBABILITY OF ERROR FOR WAGNER-CODED WORDS — COMPARISON WITH 
UNCODED AND HAMMING-CODED WORDS 


The probability of error per word for a Wagner-coded word containing m message 
digits (n = m + 1 digits in all) is 


Pil ge) eee (15) 
that is, the probability of error is one minus the sum of the probability that the word is received 
correctly and the probability that a single error is made and then corrected. We wish to com- 
H’ the probability of 


error per word ifthe Hamming single-error-correcting code is used. Since we are concerned 


pare Py with Py the probability of error per word if no code is used, and P 


with constant-data-rate systems, the duration of the transmitted signals must be altered if coded 
words (message digits plus error-correcting digits) are to have the same duration as differently 
coded or uncoded words. Changing the signal duration changes the variance of the correlator dif- 
ference and consequently the value of the parameter a and the probability of error per digit [see 
Eq. (7)]. For large TW (the only case we consider), it can be shown’ that a( = Ac/N 2s) is pro- 


portional to T?. Using this result, we find that, for the same value of a used in Eq (15), 


Pyzt-am( fmt ,) (16) 
ieee! -q™tk( ee a) peheeran hg) roo Wy tear a) pl ae a) : (17) 


where k is the number of check digits required by the Hamming code. Py Pu? and Pw have 


and 


been computed for values of m from 4 to 8 and for selected values of a. The results are given 
in Table I. 

There is only a certain range of values of the signal-to-noise ratio of the correlator 
difference for which it is worth the effort to implement either of the error-correcting codas. 
For high signal-to-noise ratio, very few errors are made, and additional equipment is generally 
not justifiable. On the other hand, for low signal-to-noise ratio, multiple errors become too 
frequent, and single-error-correcting codes are of little use. Thus single-error-correcting 


codes are of considerable value for values of a from about 1.0 to 3.0. 


5h 


TABLE | 


Probabilities of error per word for uncoded, Hamming-coded, 
and Wagner-coded words containing m message digits 


4 1.0 0.209 

1.5 0.0349 

2.0 0.00313 

3.0 42 X 10-7 
5 1.0 0.269 

1.5 0.0493 

2.0 0.00486 

| 3.0 84 X 10-7 

6 1.0 0.325 

1.5 0.0641 

2.0 0.00673 

3.0 138 X 10-7 
7 1.0 0.377 

1.5 0.0789 

2.0 0.00871 

3.0 201 X 10-7 
8 1.0 0.425 

1.5 0.0937 

2.0 0.01075 

3.0 ZT <One 


As shown in Table I, in this range of values the probability of error of Wagner- 
coded words is considerably less than that of Hamming-coded words. For increasing word 
length, the advantage of the Wagner code diminishes from two causes: (1) the ratio k/m de- 
creases with increasing m so that the length of the digits in the Hamming-coded word approaches 
those of the Wagner-coded words, thus narrowing the gap between the corresponding signal-to- 
noise ratios; and (2) the conditional probability of correcting a single error decreases with in- 
creasing m. Nonetheless, even for m = 8 anda = 2, we may expect only 103 errors per 100,000 
words using the Wagner code, as compared with 322 errors per 100,000 Hamming-coded words 


and 1,075 errors per 100,000 uncoded words. 


E. IDENTIFICATION OF THE SMALLEST CORRELATOR DIFFERENCE 


Until now we have assumed that the receiver can pick out the smallest correlator 
difference with infinite precision. Suppose, however, that the equipment used in implementing 
the Wagner code is such that the smaller of two correlator differences within « Ac of each other 
cannot be identified with certainty. We have found that even for a rather crude receiver, which 
cannot distinguish correlator differences lying within 0.1 Ac of each other, the percentage change 
in the probability of error per word is at most about two per cent in the region of interest. Thus 


the advantage of the Wagner code over the Hamming code does not depend on great precision of 


the correlators or the memory. 


55 


F. THE HAMMING-WAGNER CODE 

We now extend the principle of the Wagner code to a double-error-correcting: code. 
The following procedure appears best as a first attempt. Further check digits are added to the 
Wagner-coded word; these reveal double as well as single errors. If a double error is detected, 
we change the two digits of the stored word with the smallest correlator differences. If a single 
error is detected, we change only the smallest correlator difference. 

The success of this scheme requires a system of check digits which indicates both 
single and double errors, and further allows them to be distinguished. The geometrical model 
of message space is well suited for examining the possibility of setting up such check digits.” 
Referring to Fig.1, we see that if both single and double errors in possible transmitted points 
(such as Py and P,) are to be detectable, and if single errors are to be distinguishable from 
double errors, every such pair of points must be separated by a distance of 4 or more. For 
then, a single error in Py sends it to a neighboring point like Si where it can be stated with 
certainty to have come either from Py by a change in one digit, or from some other possible 
transmitted message by a change in three or more digits. Similarly, a single error in P, sends 
it to a neighbor like S,: On the other hand, a double error in either P) or P, may correspond 
to a received point like D, at a distance of 2 from both. Unless there are at least three points 
between all pairs of possible transmitted points, a double error in Pi (say) is indistinguishable 
from a single error in Ps (or some other transmitted point), so that we do not know whether to 
correct one or two digits in the received word. 

Now in a Hamming single-error-correcting, double-error-detecting code,” all trans- 
mitted messages are separated by at least a distance of 4. This is just the separation required 
for successful operation of a Wagner code that corrects both single and double errors. Thus 
the number of check digits needed to correct all single errors before applying the Wagner pro- 
cedure to double errors is the same as the number required to apply the Wagner procedure to 
both single and double errors. This suggests a "Hamming-Wagner" code, which is obviously 
better than the corresponding "Wagner-Wagner" code. 

We thus arrive at a code that is like the Hamming single-error-correcting, double- 
error-detecting code, except that if the extra check digit indicates a double error, we change 
the two digits with the smallest correlator differences. The analysis of this Hamming-Wagner 
code is completely analogous to that of the simple Wagner code. 


The probability of error per Hamming-Wagner-coded word is 


Prywla) = 1—a™ tht? (a) — (mt k + 1) q’™** (a) p (a) — 70 (Gaee (18) 


m+k+1 
where a, p (a), and q (a) have already been defined. The quantity k is the number of check digits 
required by the Hamming single-error-correcting code. The quantity ae (a) [in analogy to 


Eq. (12)] is the multiple integral 


*The set of possible sequences of n binary digits can be represented by the vertices of a unit cube in a space of 
n dimensions .4 The distance between two vertices is defined as the number of binary digits in which the corre- 
sponding sequences differ. 


56 


Fig.1. Configuration of points in message space 
between two possible transmitted messages P, and P,- 


TABLE II 


| COMPARISON OF HAMMING, WAGNER, AND HAMMING-WAGNER CODES 


osososesssss 
esoossoss: 
osossosess: 
cossssssssss 


(b) a = 1.80 


0.0029 
0.0033 
0.0037 
0.0042 
0.0046 
0.0051 
0.0057 
0.0062 
0.0068 
0.0074 
0.0080 
0.0086 


.0015 
0019 
0024 
.0030 
0037 
.0044 
.0052 
.0061 
0070 
.0080 
.0091 


oooocoocooooaoqoo co 
ooo oo°ceood 6] © 


Values of a are for the Hamming-Wagner code 
m = number of message digits 


57 


ay (a) = ue ne [- (x Zalilox a exp [- (x ~a)“Jdx ad 
n a eAY ra Le n mB hs n-1 n-1 


x x x 
f ee (x3 -a)“]dx, if 3 exp [- (x +a)" Jax, f ? oxp[- (x, $a)" Jax, 
fe) O oO 


Equation()9)can be reduced to the sum 


(19) 


n : [e°e) : 
1, (a) = Mn-¥) 2B CEMA “y(a), where 71,4) = if [erf(x-a)]'~* exp[-(x+a)"] dx (20) 


in complete analogy to Eq.(13). 


Paw (1-35) and Paw (1-80) are tabulated in TablelII for various values of m, together 
with the corresponding probabilities of error for uncoded, Hamming-coded, and Wagner-coded 
words. The values of a used in computing Py H? and Py are chosen so that all words (message 
digits plus check digits) have the same duration, as required in a constant-data-rate system. 


Thus 


a m _ f/mt+kt+) 
GS SoBe Gs © “UN eo te aes 
+k +k-1 ap de ap 
Pry (ayy) = 1 a (a,,) - (m +k) " (a,) p (ag) » a = f BAS (21) 


4. m+l eS) Mme pikenl 
By (ay = tod (ay) — Tay (ty) Cae. Worn aE 


TablelI shows that for a = 1.35, a very noisy case, the Hamming code becomes better 
than the Wagner code at m = 21. For a= 1.80, which corresponds to much less noise, the 
Hamming code surpasses the Wagner code at m = 24. Thus it appears that, starting with some 
value of m between 25 and 30, the Hamming code is better than the Wagner code anywhere in the 
significant range of a (neither too little nor too much noise ). This happens for the reasons given 
at the end of Sec.D. 

We see from the table that the Hamming-Wagner code is consistently better than the 
Hamming code; however, the percentage improvement is greater for a = 1.80 than for the noisier 
case a= 1.35. Fora= 1.35, the Hamming-Wagner code is better than the Wagner code for all 
m > 17; for a = 1.80, the Hamming-Wagner code is better than the Wagner code for all m > 13" 
Thus, while the Wagner code is superior to the Hamming code for words of length less than about 
20, the Hamming-Wagner code is superior to either of these codes for words of length greater 
than about 15.** The Hamming-Wagner code works better in low noise than in high noise, because 
(1) proportionately fewer multiple errors are of order higher than two, and (2) the conditional 
probability of correcting double errors is higher. Since this conditional probability decreases 
as m increases, the Hamming-Wagner code gradually becomes less effective, as shown in the 
next section. 


*The Hamming-Wagner code is also better form =10 and 11. This anomaly is due to the change in k from 4 to 
5 atm = 12. 
**The value of m for which one code becomes better than another is somewhat dependent on a. (See Table UW.) 


58 


G. THE SYLLABIFIED WAGNER CODE 


Another multiple-error-correcting code based on the principle of the Wagner code is 
the syllabified Wagner code, constructed by dividing each word into separately Wagner-coded 
subwords or syllables. Suppose a word with m message digits is divided into j syNables, each 


containing n, =m, + 1 digits, where 


Since the probability that a syllable (regarded as a Wagner-coded word) is correct is 


sa 
qe (ay Pitteay 
1 


the probability of error for a syllabified-Wagner-coded word is 


j n, j 
im) = 1 <oell [q *(a) + Ho (a)] g, 2.9m, =m) = lz) 
i i= 


Poy (my> M5) +++ a 


It follows at once that for a given number of syllables Pow (my» Mas eoes vy is smallest when 


the syllables have equal length (or as nearly equal as possible), 


If too few syllables are used, the conditional probability of correction of single errors 
per syllable is small because the syllables are too long. If too many syllables are used, this 
conditional probability is small because the large number of check digits leads to a small value 
of a. (This second effect is partially compensated by increased multiple-error-correction pos- 
sibilities.) The optimum number of syllables is a compromise between these two effects. This 
optimum number is not necessarily critical, or for that matter the same for alla. The simple 
Wagner code (which may be considered a syllabified Wagner code of one syllable) is clearly best 
for short words. At about m = 14, division into two syllables is better, than the simple Wagner 
code. At m = 30, divisions into three and four syllables are about equally effective, and better 
than divisions into more or fewer syllables. A syllable length of seven to ten digits seems to be 
best. R 
All these points are illustrated in Table III, which compares Paw and Pow: for several 
values of m and iw = 1.80. The table also shows how the syllabified Wagner code finally sur- 
passes the Hamming-Wagner code at about m = 80. As previously mentioned, this is due to the 
decrease in Cuw’ the conditional probability that the Hamming-Wagner code corrects double 
errors, aS m increases. This decrease in Cow is also shown in Table II. The formulas used 


for calculating the P's are the same as those in Eqs.(21) and (22) with the a's related by 


pat Hooda kesh dees a 2 yin bik chek ty A _ /mt+k+tl 
oe m HW ’ °W i to MWS Wao Nese on TW (23) 


where m is the number of message digits, k the number of check digits, and j the number of 


syllables. 


bg 


TABLE III 


COMPARISON OF THE HAMMING-WAGNER AND SYLLABIFIED WAGNER CODES 


Pow 


0.00107 0.0081 


0.0087 


0.00148 
0.00144 


0.00228 


.00322 
.00342 


.00438 
.00449 


.00577 
.00570 


.00700 
(00735 


.00981 
.00975 
.01006 


0.0200 
0.0201 


0.0318 
0.0317 


0.0468 
0.0659 


0.00143 


0.00186 
0.00236 


0.00292 


0.00356 


0.00426 


0.00730 


oo OF o.oo, 0°O" 0O;O 2 


0.0146 45.5 


0.0244 70 


0.0448 0.43 
0.0688 0.38 


1 
z 
1 
2 
2 
2 
3 
2 
3 
zs 
3 
3 
4 
3 
4 
5 
5 
6 
6 
i 
8 
0 


— 


number of syllables 


H. THE REED CODE 


We now examine the performance in a constant-data-rate system of the Reed code,> 
an algebraic multiple-error-correcting code. The Reed code is applicable only when the total 
number of digits in a word is a power of 2. Corresponding to each possible word length, there 
are only certain possible values of the order to which errors may be corrected. For each of 
these possible values, the number ot message digits is determined. This feature limits the ap- 
plication of the code in communication systems, for the number of message digits in a word 
(fixed by other considerations) may not correspond to a possible choice in a Reed code. Com- 
plete details are given in Reed's paper. 

Table IV shows corresponding probabilities of error per word for a Reed three-error- 
correcting code, the Hamming single-error-correcting code, and no code at all. The formulas 


for Py and Pu are the same as those given in Eq.(21); the probability PR is given by 


60 


3. /mtk m+k,-i 
i 
Pyeie.s ( a R (ag) p (ap) (24) 


where kp is the number of Reed check digits, ky the number of Hamming check digits required 
for m message digits. The relations required to find the a's used in Table IV are 


m + 
e Ky 
uF m sa 


H (25) 


We see from Table IV that the Reed code outperforms the Hamming code, even for 
m=16. Thus the decrease in a produced by the extra check digits of the Reed code is more than 
compensated by the ability to correct all double and triple errors. The advantage is less marked 
for smaller a, since in high noise many more errors of order greater than three are introduced 
~y the shortening of the digit length. 

In Table V, the Reed code is compared for three of its allowed values of m with the 
vest of the Wagner codes. The probability of error for uncoded words is given for reference. 

We see that the Wagner-type codes can compete with the Reed code in high noise. 

As the noise decreases or m increases, the Reed code increases its advantage. It is clear that 
for ordinary communication purposes, the Reed code would be better for long words than any of 


the previously considered codes if the restriction on the allowed values of m could be removed. 


TABLE iV 


PROBABILITIES OF ERROR FOR UNCODED, HAMMING-CODED, 
AND REED-CODED WORDS 


m a P P P 


H U H R 

16 15 0.114 0.049 0.047 

16 2.0 0.00953 0.00112 0.00042 

42 HUG) 0.389 0.195 0.163 

42 2.0 0.0512 0.0057 0.0011 

99 We) 0.754 0.538 0.449 

99 2.0 0.1558 0.0259 0.0041 
219 La 0.967 0.899 0.839 
219 2.0 9.354 0.100 0.018 


61 


TABLE V 


COMPARISON OF REED CODE WITH HAMMING-WAGNER 
AND SYLLABIFIED WAGNER CODES 


mi iw Py Puw PR 
16 1,80 0.0224 0.0019 0.0022 
42 1.80 0.1181 0.0146 0.0097 
7 tow j Py Pow Pr 
99 1.50 10 0.726 0.359 0.403 
99 2.00 10 0.1379 0.0151 0.0029 


I. SUMMARY AND CONCLUSIONS 


We have considered the use of several types of binary codes in communication sys - 
tems, making the following assumptions:— 


(1) The system transmits sequences of binary digits known as words. If any 
digit is altered, the information carried by a word is lost. Thus, by definition, se- 
quences obtained by combining words are not themselves words. 


(2) The transmitted digits are one of two electrical signals of bandwidth W and 
duration T. They have equal energies and equal a priori probabilities. 


(3) The entire coded word must be transmitted in a given time, regardless of 
the number of code digits required to check the message digits. (Assumption of con- 
stant data-rate. ) 


(4) The transmitted digits are corrupted by the addition of white Gaussian noise. 
They are determined by choosing the larger of two independent and normally distributed 
correlator outputs. The time-bandwidth product, TW, of the transmitted signals is >> 1, 
so that when the signal length is changed to accommodate different numbers of check 
digits, the signal-to-noise ratio of the correlator difference voltage is proportional to 
the square root of the signal length.? (Actually, the signal-to-noise ratio is proportional 
to NTW, but we assume that W is not changed, an assumption that requires TW >> 1,) 


By the best code (of those we consider) for a given word length and channel noise, we 
mean that for which the probability of error per word is smallest (under the assumption of con- 
stant data-rate). We have considered the following systematic codes: — (1) the Hamming single- 
error-correcting code, (2) the Wagner code, (3) the Hamming-Wagner code, (4) the syllabified 
Wagner code, and (5) the Reed multiple-error-correcting codes. (The Wagner, Hamming-Wagner, 
and syllabified Wagner codes are introduced in this paper.) For short words (m < about 15) we 
find that the Wagner code is best in the range of interest (neither too little nor too much noise). 

As m increases, the Wagner code is surpassed by both the Hamming-Wagner code and a syllab- 
ified Wagner code of two syllables- For values of m < about 80, all syllabified Wagner codes arc 


*The Hamming code surpasses the Wagner code for m about 20, but fs always inferior to the Hamming- 
Wagner code. 


62 


inferior to the Hamming-Wagner code. For larger m, the conditional probability that double 
errors are corrected by the Hamming-Wagner code has fallen sufficiently so that a syllabified 
Wagner code is better. Thus, were it not for the Reed code (which is only applicable for a few 
word lengths), we could say that the Wagner code is best for short words, the Hamming - Wagner 
code for medium length and long words, and the syllabified Wagner code for very long words. 
However, the Reed code outperforms the Hamming-Wagner code at m = 42 and the syllabified 
Wagner code at m = 99 (except in excessively high noise), showing that for large m there is no 
substitute for an algebraic multiple-error-correcting code. We can safely say that if the Reed 
code can be generalized to apply to any number of message digits, it will be the best code except 
for short words. This assumes that the proportion of check to message digits turns out to be 
comparable to that of the present Reed code. 


The numerical work reported here was done by Mrs. Elizabeth Munro. 


REFERENCES 


1. C.E.Shannon, Bell System Tech. Jour. 30, 50-64 (January 1951). 
2. R.W.Hamming, Bell System Tech. Jour. 29, 147-160 (April 1950). 


3. I.S.Reed, "A Class of Multiple-Error-Correcting Codes and the 
Decoding Scheme," Technical Report No. 44, Lincoln Laboratory, 
M.I.T. (9 October 1953). 


4, P.M. Woodward and I.L. Davies, Proc.I.E.E. 99, 111, 37 (March 1952). 


W.B.Davenport, Jr., R.A. JOune or and D. Middleton, Jour. Appl. Phys. 23, 
4, 377-388 (April 1952). 


63 


INFORMATION, ORGANIZATION AND SYSTEMS 


Jerome Rothstein 
Columbia University 
New York, N.Y. 


The object of this paper is to develop and apply a mathematical concept of organization and of 
systems. It is very closely related to the information concept and provides the link whereby the 
theorems of communication theory become generalized and applicable to systems in general. Brief ap- 
plications are given to system reliability, the significance of organization theory for circuit de- 
sign, and production and quality control for a systems viewpoint. 


The Nature of Organization 


Intuitively, one equates disorganization to chaos. This suggests the possibility of measuring 
quantity of organization by the amount of information required to specify an organization in terms of 
its unorganized components, or by the entropy increase occurring when the organization is dissolved. 
The two approaches coincide, and organization can be regarded as a generalization of the entropy con- 
cept, just as information is. 

Consider a set of elements, each associated with a set of alternatives. It is unnecessary to 
specify the nature of elements or associated alternatives for the mathematical theory, but it may be 
helpful to keep concrete examples in mind, e.g., a message source or a physical system as the element, 
and the ensemble of possible messages or the set of operationally distinguishable states of the sys- 
tem respectively as the associated sets of alternatives. The elements need not be of similar natures, 
nor need the various alternatives associated with a given element have more in common than that 
association. 

Call each set of alternatives associated with an element a space. Restrict these spaces to the 
class of measure spaces on which a probability measure and thus entropies can be defined for subsets 
of each one. Consider the measure space formed by taking the Cartesian or direct product of all these 
spaces. A "point" in the product space can be looked at as a "vector," each component being a point 
in the space of some element. Probabilities and entropies can be defined for subsets of the product 
space. The set of elements will be said to be unorganized or have zero organization (synonyms: 
mutually independent, statistically independent, uncoupled, not in communication with each other, un- 
connected, etc.) if the entropy of a set of points in the product space is the sum of the entropies 
of the corresponding sets of points in each of the spaces associated with an element. This is clearly 
the case of maximum set entropy in product space for specified sets in each element space. 

The essence of organizing a set of elements resides precisely in the fact that elements do influ- 
ence each other. Synonyms for organized are: coupled, linked, correlated, connected, in communication 
with each other, coordinated, interacting, etc. It is in these couplings, correlations, or constraints 
that the organization consists. A measure of amount of organization entailed by the couplings is the 
concomitant reduction of entropy in product space compared to that calculated for the unorganized 
state of the same elements. Organization, like information, is thus a negative entropy. It is maxi- 
mal for perfect correlation between elements, i.e., functional rather than stochastic relations be- 
tween them. Consider the case of two elements (generalization to a finite mumber is simple), each of 
whose spaces can be represented by a line segment over which a probability distribution is given. 

The product space is then a rectangle with the segments as sides. For zero organization, assuming 
probability densities are defined on the segments, a probability density is defined over the rectangle 
and is simply the product of the segmental densities. For maximal organization, the domain of non- 
vanishing values of the distribution in product space shrinks to a set of zero planar measure, e.g., 
a curve. The segment distributions and this one do not differ essentially, as each is a one to one 
map of the others. In general, the segment distributions are marginal distributions corresponding to 
integrating out one of the two variates. For organization between zero and maximal, the probability 
density in the rectangle is peaked in some regions and lowered in others compared to the product of 
the segmental densities, with maximal organization.a kind of delta function limit of this. More ex- 
plicitly, if the probability density in the rectangle is p(x,y), the marginal distributions are 
p(x) = /p(x,y)dy ' ( 
p(y) = /p(x,y)dx ( 
with the corresponding entropies (using Shannon's notation) given by 
H(x,y) = -/p(x,y) log p(x,y) dx dy ( 
H(x) = -/p(x)log p(x) dx ay 
H(y) = -/p(y)log pl(y)dy ( 


the amount of organization, AH, is 
MOH = H(x) + H(y) - H(x,y). (6) 


6h, 


In terms of conditional entropies, Hx(y), Hy(x), and the known relations 
H(x,y) = H(x) + Hx(y) 
= u(y) + Hy(x), (7) 
this can be rewritten in the forms 


AH = H(x,y) — Hx(y) Fe Hy (x) 

H(y) — Hx(y) (8) 
=H(x) — Hy(x) 

From any of these expressions for AH, it follows immediately thatAH vanishes for statistical indepen- 

dence of the two variates, and is a maximum equal to either univariate entropy, for functional connec- 

tion between them. 


Systems 


Intimately connected with the concept of organization is that of a system, defined as an organi- 
zation with @ function, i.e., it couples two (or more) ensembles of interest. Synonyms for function 
are: task, program, behavior pattern, stimulus-response, input-output, etc. System plus function con- 
stitute an organization as defined earlier. Reasons for introducing a separate definition are practi- 
cal, e.g., (a) the function is often specified in advance and not under the control of the system de- 
signer, whereas (b) design of a system to realize the given function is the center of interest and 
under the control of the system designer, (c) simplifications often result if part of a complex organi- 
zation whose internal interactions are much stronger than its interactions with other parts is treated 
as an entity in itself, (d) dealing with system and function often requires entirely different tech- 
niques and one can often be treated almost independently of the others, (e) except for requirements 
that the system be optimum in some sense or senses, @.g., economical, dependable, etc., the particular 
elements and associated ensembles of alternatives are not of interest in themselves but only as means 
to an end, viz. performance of the desired function. The function of the system can be simply de- 
scribed as organizing a product space, e.g., the product of input and output spaces. A communication 
system is a system in this sense, its function to couple an ensemble of messages at a source to an en- 
semble of messages at a destination. Similarly, a physical theory is a system for predicting - it 
couples states at a future time (output) say, with initial states (input). Alternatively expressed, 
theory organizes observation. Other examples from many are strategies, which organize one's play (cut 
down the ensemble of possible moves in a given position; input is the position before moving, output 
is the position after moving), manufacturing systems (input of raw materials, output of finished 
items, manufacturing tolerances correspond to permissable noise level, etc.), transportation systems, 
automatic control systems, etc. 

Consider a system with input space (x), output space (y), and marginal entropies H(x), H(y), 
H(x,y)- The amount of organization introduced by the system is clearly given by (6) or (8). The 
same equations hold if entropy is replaced by entropy rate. For the special case of (x), the message 
space at a source, and (y), the message space at a destination, these are familiar equations for rate 
of transmission of information. In general, they give the rate (usually denoted by R) at which the 
system organizes the product space. The maximum rate for all input ensembles is called the system 
capacity and can be expressed as 


lim Max 1 p(x,y) 
Cope gs. 13(¥) = J r(x,y)1og WRG) os (9) 


C is the channel capacity for a communication system. 

It should be noted that time rates are not the only ones of interest, though they are probably 
most important. Storage capacity of a given medium for information is another. For a magnetic tape, 
for example, T would represent length of tape, and C could be measured in bits/cm. Similarly, T would 
be area for information stored on photographic film, etc. Different capacities are often related when 
systems are formed, as in the case of transmission of information stored on a magnetic tape. If the 
information is read off the tape moving at speed v, and it has storage capacity C (per unit length), 
then the minimum channel capacity required to avoid information loss is vC. Factors analogous to v 
come up (derivatives, Jacobians)’ when different media are coupled. 

Before returning to the general theory, it seems advisable to examine briefly a system other than 
a communication system, e.g-, a production unit of a manufacturing enterprise. The entire enterprise 
is an organization of which the production unit is an element. In its interaction with the rest of the 
organization, it has input and output of material and information. It is thus itself an organization 
with a function, or a system. Its input consists of the raw or partly processed material on which it 
operates and the input and output specifications. The output consists of the partly finished item to 
be passed on further in the organization for additional treatment or the end item itself. Its capacity 
(here, productive capacity) is the maximum rate at which it can produce acceptable output items. The 
input and output ensembles are sets of mathematical entities, e.g., numbers, serving to describe the 
incoming and outgoing objects with desired exactness. They often consist of the results of measure- 
ments made on the object, go - no go indications, and the like. Each object, incoming and outgoing, 
corresponds to a "point" in a space (usually discrete). The statistical population consisting of the 


65 


totality of these objects defines input and output distributions. The specifications require that the 
object points fall in certain regions. If the entire output falls within the output specification, 
the production unit is reliatie, the "message" (i.e., specification) has been transmitted to the 
"destination" (output object) through the "channel" (production unit). When the "message" is garbled 
by "noise," i.e., the specification limits are exceeded, the object is rejected on inspection. Just 
as in a communication system where one strives not to make sure that all messages are received cor- 
rectly but rather to keep errors below some acceptable level, so in manufacturing processes, one aims 
not at making all items identical but rather to keep them within acceptable tolerance limits. It can 
be seen that statistical quality control in production and statistical communication theory are indeed 
closely related. Furthermore, the general theorems of communication theory apply to all systems; 
little difficulty arises in general in specifying input and output ensembles. 

Returning now to the general theory, the rate of organizing the input-output ensemble is clearly 
uecreased by any weakening of the coupling between them. Errors, unreliability, noise, breaking of a 
physical connection included in the coupling means, etc., tend to increase Hy(x) in the expression for 
rate R = H(x) — Hy(x). (10) 
If the error statistics are independent of those of the input and output ensembles (corresponding in 
communication to having noise statistics independent of what messages are chosen), an error entropy, 
analogous to noise entropy in communication theory, can be defined. In many important situations, 
this entropy, denoted by H(N), includes all of the deviation from maximum organization, in which case, 
one can write R = H(x) — H(N). (11) 
This can be extended to yield a generalization of Shannon's theorems on the maximum rate of transmis- 
sion of information in terms of available signal and noise powers, if a suitable superposition prin- 
ciple for signal and noise is satisfied. For the magnetic tape earlier discussed, this is the case 
with noise power replaced by minimum significant variation in intensity of magnetization and maximum 
signal power by saturation tape magnetization. The techniques of wave-form analysis also apply, with 
time domain replaced by space domain, frequency by wave-number (reciprocal of wave-length). All the 
theorems carry over and in particular, with suitable coding, a storage capacity arbitrarily close to 
the theoretical maximum can be obtained. 


Concluding Remarks 


The usefulness of an analogy resides in making an unfamiliar field come under previously known 
concepts, in making previous knowledge applicable to new situations. Information theory, or cyber- 
netics, has been doing precisely this, and the present paper attempts to carry the process further. 
The organization concept appears to be a generalization of the information concept in that it (a) in 
no way depends on the asymmetry between transmitter and receiver implicitly assumed in communication 
theory; (b) handles any number of interacting ensembles, not just the conventional pair of communica- 
tion theory; (c) includes information as a special case. Examples based on circuit considerations 
where the new concepts are applicable will now be given. 

In the field of circuit design, one might expect information theory to have considerable impact, 
but this has not been the case. The reasons appear to be (a) that the components of eircuit theory 
are either taken to be noiseless, so that each function is merely a one-to-one mapping, trivial from 
the information-theoretic viewpoint; (b) many special methods of reducing noise (shock mounts, shield- 
ing, ruggedized tubes, etc.) are effective without the use of complicated theory; (c) statistical fil- 
tering and the like is justified in relatively few cases. 

One would expect information and organization theory to be of real importance in cases where the 
component is truly an element of an organization, with output ensemble specified only statistically 
by its input. The two chief cases of this appear to be (a) when the individual circuit includes ele- 
ments whose behavior is specified statistically, and (b) when the individual circuit is determinate 
but one of a statistical population of ¢ircuits made to the same specifications, whereby the input to 
a given element reflects the statistics of elements feeding it, and its output reflects the additional 
effects of its own probability distribution over a set of performance-determining parameters. It is 
clear that the two cases are closely related mathematically, but they may be appropriate to entirely 
different physical situations. 

Case (a) would include, for example, complex electronic systems like computers, where reliability 
(i.e., elimination of "generalized" noise) becomes important even for individual elements. The con- 
trolled use of "redundancy" (i.e., channel capacity in excess of that required to transmit information 
at a specified rate in the absence of noise) to reduce the effects of noise is clearly applicable to 
increasing the reliability of the computer. Self-checking codes are precisely this. In general, a 
similar employment of additional system capacity can increase system reliability, another example being 
a decrease in reject rate sometimes achieved by establishing inspection points at intermediate stages 
of a production operation. Case (b) comes up when one is faced with the problem of designing mass- 
produced equipment to some performance specification with preassigned tolerances specified for the com- 
ponents. When the number of components is large with ordinary design methods, even moderate perform- 
ance specifications can make unreasonable demands on components. Here information or organization 
theory may well be crucial in design, for the components are "noise" sources which in their totality 
blot out the "message" (falling within performance limits). "Redundant" design appears mandatory, and 
can be viewed as a sophisticated version of the familiar "factor of safety." 


66 


AN INFORMATION-THEORETICAL MODEL OF ORGANIZATIONS 


Manfred Kochen 
Institute for Advanced Study and Paul Rosenberg Associates 
Princeton, N. Je Mt. Vernon, N. Y~ 


Introduction 


An organization can be treated as a set of component members which, as a group, is capable of per- 
forming functions which the individuals are not. This property is due to the ability of each member to 
influence, and be influenced by every other one in the system. 


It is meaningful to consider only those organizations which can be defined operationally, relative 
to a given observer. The observer obtains information about the system he is describing by interacting 
with the system much in the same manner in which the components of the system interact with one another. 
Just as the above-mentioned observer-system pair may be viewed as an organized system in itself within 
a larger environment, each member of the originial system, and any collection of these, can be viewed 
as subsystems also. The observer describes the system upon which attention is focussed according to a 
task or function, and the efficiency with which the system performs it. The function of the entire 
group as well as that of each member is assumed to be physically observable and measurable by the 
observer, although not necessarily by the members. 


The basic notion which is exploited here is that every potential member of an organization acts 
so as to maximize the value resulting from his action to him. Organizations are assumed to exist, 
because the value of belonging to the group is greater to each member than of not belonging. This 
holds clearly also in cases where members are constrained or coerced to be in the group, in which 
case the value to these members of not belonging is more negative than that of belonging. The potenti- 
ally most highly organized systems are those in which there exist states of the system for which the 
values to every member are most nearly all maximized. 


Although a system may be highly organized potentially, the potential may be far from realized, 
in which case the efficiency is said to be low. It will be seen that this efficiency may be measured 
in terms of the uncertainty of each member about the actions of the others; that this uncertainty 
is best removed by efficient coding procedures, which admit of the maximum of intercommmnication 
within the group, considering limited storage capacities. 


It is the purpose of this paper to make the above concepts more precise; to establish a formal 
mathematical model based upon five axioms which define the members, and characterize their permitted 
behaviour; to apply the well known results of information theory in order to determine which modes 
of communication and action should be, can be, and are most likely to be evolved for each member 
to maximize his value function; these being assumed as known to the observer to whom these questions 
are directed. 


The Formal Model 
Axiom 1: 
A set S, is assigned to each member P; of the group P = (P,,P gevesPy)e In the duration of 
time St, P, may choose one of the elements of  S;, denoted as s; by the observer who is outside 
of the system. 

The specification of the set S; is the task of this observer also, and the set may be non-denu- 
merable, as, say the set of all real numbers in the closed interval [0,1] , countable, such as the 
set of all integers, or finite, depending upon the observer. In most of the discussion to follow, 
all the Ss will be assumed to consist of m elements, although not necessarily the same ones for each 


a 


49 


The notion of choice is left undefined, but it is assumed that the result of each choice has an 
effect on Pj, i = 1,...,N and on the observer, which is capable of being recorded in coded form: 


Axiom 2: 

A set of "code symbols" composed of binary elements (zeroes and ones), S,, , is assigned 
to each Py with reference to Py, The effect of a choice by P; is the selection of some element from 
S33, denoted as sj4 to P;, the selection being made by P;. That is, P; encodes as $54 what the outside 
bhaseren calls s;. 

Crudely aia, when Ps chooses and performs action s;, Py "interprets" this action as $5i° The 
axiom essentially states that each member is able to receive the effects of actions made by other 
members, and make statements about these actions. The actions may be expressed statements themselves. 

It is not necessary that S4; and Fi be in one-to-one correspondence, and it is, in fact, this, 
which contributes to Pj's uncertainty about P;. 


67 


Axiom 3: 

A set function, v; , which maps each state of the system as encoded by Ps» CG: = (S14 5++sSNy) 
into an element of an image set, Ty is assigned toeach P;. The set Ty is assumed to be partially 
ordered, 

The function Vy is to be interpreted as defining the value of some states of the system to P 
relative to some or all dther states. If the ordering is total, then for any two states, one is etther 
preferred, considered indifferent to, or inferior to the other. In some cases, the assumptions on Ty 
may be strengthened; open sets and neighborhoods may be defined in it, so that it is possible to say 
that one state of the system has a value to P; which is "close" to that of another state. Further, 

T. may be considered a metric space, with some of the properties of number sets. The choice of the 
specifications of the T; as well as the v; depends upon the observer's knowledge of the system, and 
is known, a pr&ori, to only. 

A state Gi is said to be a maximal state for P; if it is preferred to every other state known to 
P;. There need not be a unique maximal state for every vy. In case S; i=l,...,N and T; are topological 
spaces, v, is defined to be continuous; then, even house a maximal state may not always exist, a 
state for which v; takes on its least upper bound (supremum) always exists. 

If it is supposed that Sy and S,; are in 1-1 correspondence for all i, j, and there exists a state 
which is a maximal state for all the P,, the organization is potentially perfect. 


Axiom 4: 

P, chooses that element from S;, which,according to his data about the ordering on T;, maximizes 
his expected value. 

If it is temporarily assumed that v. and the frequencies of the various possible choices and 
states of the system are known to P;, Axtom 4 is to be interpreted in the sense of game theory.~? 
There are, of course, several other ways of formulating axiom 4, such as the minimax principle used 
in game theory, the Bayes criterion, Hurwitz criteria, etc. Here the observer again specifies the 
criterion according to which the members make their choices. 

Let the T; be number sets. The quantity 6) 

N Vv; ( 
/ A MUR ST 
P= Mex 2. FG) ene lou lame era 


c= 


may be taken as a measure of the order of the organization. It is easily seen that if the system is 
potentially perfect, all the §€,; and hence $(G) is 1, where G is the maximal state. A value of 0 
for € is interpreted to mean that the system is completely disorganized. It is quite possible that 
for certain kinds of organizations, a lower limit on £& can be determined, which is to serve as a 
disintegration threshold. 

That there exist value functions in systems for which a state satisfying all the members does not 
exist can be demonstrated by the voting paradox, Let N=3 and consider 3 distinct states of the systen, 
G, Gz Gs In the following table, + denotes a high value, and — a low one. If a group value function 
is defined as + when two or three of the three members assign the value + to a preference, it is easily 
seen that the group value is not consistent in the sense of transitivity. > denotes "is preferred to", 


CSE CA G > G3 Gi > G3 


Py + 

Po - + = 
1 + - - 
p> + - = 


Axiom 5: 
P,; may store up to c; binary digits, and may change at most r, bits per time unit at -« These 
quantih ea are important ay determining the amount of information b, may store if he encodes it properly, 
and the number of other members with which he may be in contact per time interval. 
It is not specified what P; may store, but under certain conditions there is always an optimal 
set of quantities and an encoding procedure, in the sense that P;'s value will be as large as possible. 
There are also optimal communication networks which will lead to the realization of the — of the group. 


Data 


The time interval at, which has been referred to above, and which was defined as the duration 
in which only one choice from S; may be made, shall henceforth serve as a time unit. Let t be an integer 
which denotes the number of time units which have elapsed since the observer began to observe the 
system. It is now assumed that the following data is available to P;, i=1,...,N, at time ts 

1. The t states through which the system, as encoded and stored by Py, has passed. 

2. The t relative values associated with each state. 

Clearly, only so much of this information is actually available to P, as can be suitably stored by 
him. The data, as described above, can be summarized in the form of a rectangular matrix, at most Nxt. 
Dit = (dyzsdy2,-0-,dit), where dyz is the transpose of the row vector (s 2°°99N3 $5 Viz) + Because 
of P;'s limited storage capacity, all the entries in this matrix need not bi-Hited, Ant the actions 
and decisions Py is able to perform must be based on this data only. 


68 


All the N data matrices, the knowledge of the Vis etc. are, of course, available to the observer 
who describes the system also. 


Variation with time 

At t=0, which might be called "a priori", P, has no basis whatever for any action; nor does any, 
other member. During the time interval (0, at) pt may receive and store the choices of as many as 21 
members, including himself, such that no two of ean correspond to the same sequence of bits. During 
the next time interval, P,; receives a message about a choice from, say P,. P; first compares the messag 
with the one he has pacetie? from Py (if any) during the previous time fnterval. If the two messages 
are the same, it is unnecessary to waste storage by recording the second message;it is sufficient to 
indicate the fact that there are now two memhers in the category which was established during the first 
period. If the second message differs from the first, a new category in the set S j is established. 
In this way, the sets S54 for all j and i are built up in time; they may even be talled "ensembles", 
because frequencies are*associated with the elements, 


Consider the relationship of P; influencing P,. This may be defined operationally relative to P, 
as the condition in which P;'s choice is followed by a certain choice of P, with a frequency greater 
than could be attributed to chance alone. It would be reasonable to requiré that if P, influences Py 
strongly relative to Pry Py should also influence P, strongly relative to any other Pity who has 
information about the pair. The consequences of thid requirement have yet to be explored. 


Examples 
In the following examples, N=2. The extensions to N> 2 are not difficult to imagine. 


Example 1. Defense team. 
The purpose of the pair, as determined by the observer, is to destroy an opponent. Let Sy consist 
of the following two elements: a , 
S, : to act as decoy; to sacrifice, drawing the opponent!s entire atten- 
tion. 
Si’: to strike at opponent, with assurance of destroying him. 


S, consists of the analogous two alternatives, plus a third: 


@) 


J, : to feign or bluff, neither striking nor sacrificing. 
Define the codes as follows: 
Gia aed we -g (p ieak Oe ae ey ANE ON Cs 
Set he al Sak Si. £. es abe inhib Cte Fi 
Vj(d,d)=t Vk, ad)=9 VACHE CG ee 
Cai esis Dy, (ethan Va dk) =b yk, kKy=9 VFR I= OF 


In this example, the letters t,b,g,e may be thought of as representing the values: terrible, bad, 
good, excellent, as interpreted by P,; these letters with primes denote the analogous values to P,. 
For instance, vo(d',d') = t' states that P, associates the value "terrible" to what he interprets as a 
simultaneous choice by himself and P, to sacrifice. This example illustrates the importance of communi- 
cation to organizational efficiency; also one of the ways in which communication may fail, since P 
confuses the choices of f' and d! by Po in this example. 1 


Example 2. Duet of singers. 


Suppose that the objective of this duet is a harmonious rendition of a musical composition for two 
voices. S,(t) is the collection of all the possible sounds which P, could choose to make during the 
period (ty t+ At), and S(t) is similarly defined for Py. S»5) is the set of all the possible sounds 
which Py has heard Po e, so that any sound which Py makes will be classified as some element of Soi° 

S,, is the analogous set for P}. The value functions, v,(s1), 82)) and Va(So95 S)>) depend upon the 
erttical discrimination and musical abilities.of P, and Po respectively. An experiment can easily be 
designed in which P, and P, can rate the value of any particular tone combination (state of the system). 

It is intuitively clear ift this example how each singer maximizes his value function, and what factors 
determine the extent to which the pair can achieve the objective. 


Example 3. Production team. 


Let P, be a worker, and P. his boss. It is quite unnecessary to provide an objective for the pair 
other than that each tries to maximize his value function (the value to himself). This was the case in 


69 


the previous two examples also, the objectives having been stated for clarity. Let S; be the set of all 
the possible levels of intensity at which Py is capable of working. S,; is a scale by which P, measures 
or "judges" this quantity. S,5, is a scale, Subjective for Pj, by which Pp can measure the same quantity. 
That is, an element of Sj; represents the boss's interpretation of how hard the worker works. Let Sy 

be the set of all 63981506 rewards to the worker of which the boss is theoretically capable. a is’ the 
scale by which Py measures this, and S}5 is the scale by which P, measures it. v1(834 21) is Ga value 
to P,, in P,'s subjective value scale, if P, chooses to work with intensity s,,, and ff 2 chooses to 
reward P; to extent 85), both measured according to P,'s scale for these. vo 1s similarly defined, 


This example illustrates how the selection of an alternative may differ from a statement or an intex 
pretation about this selection. Communication proceeds essentially by making statements about state- 
ments about statements etc.; more precisely, this amounts to a time-varying coding procedure. The signi- 
ficance of the order in which the members make their choices depends upon the extent to which communi- 
cation throughout the whole group prevails. The effective duration and speed of the evolution of or- 
ganizations will also depend in part upon the extent of communication, which is discussed in the next 
section. 


Example 4. A Learning Experiment. 


In the Bush-Mosteller? model of learning, applied to a rat in a T=-maze, P, is the experimenter, 
Po the rat. S,; consists of the two alternatives: reward, non-reward; these are encoded by P, as, say 
Ej and E,, forming the set Sj). So is the pair of the rat's alternatives: to turn right, or to i 
turn left;"P; encodes these as A, and As, forming the set S2). It is not known how the rat encoded 
his own choices or those of P,, but if the outside observer understood the rat sufficiently, it is 
assumed that the code could be determined. The value function for the rat may be surmised quite easily; 
for the experimenter, the value function depends upon his expectations and his prior knowledge of 
the behaviour of rats in such situations. It is noteworthy that in such an experiment as this one, 
both the rat and the experimenter "learn": acquire information by removing uncertainty about the other's 
behaviour. 


Example 5. A Mechanistic Physical System. 

The two magnetically coupled circuits shown in the 
figure are offered as a final example of an extreme 
kind of organizations. P, is the outer circuit, P 
the inner. S, is the pail of possible positions 2 
of the switch in P|, and S, the corresponding pair 
for Po. Siz is thelpair of physical states which 
accompany @ach position of the switch off P,; these 
two states of P, might be: the current in b being 
below a fixed level, and above. Sy isa r of 
physical states of P) which accompany each position 
of the switch in Po3 these might be the force 
acting on the spring holding F,'s switch in place . 
with a value greater and less than a fixed number. 


Me) 
EY 


In the figure, the springs exert forces on the switches so as to keep them in the positions shown 
when no current flows in either circuit. When the lower switch is closed, there is current in P, and 
a field is created by the upper electromagnet, closing the upper switch. This, in turn, activates the 
lower electromagnet, opening the lower switch, and causing the upper switch to open again after a slight 
delay. It is essentially a double feedback system, and will oscillate. 


The value functions are identical for P, and P,, and may be interpreted as follows: there are only 
four possible states for Py and P,3 ( a state for the entire system in this case means a pair of states, 
one from each member) the value to Py or P, of a state which is consistent with Maxwell's equations and 
flooke's law (if it applies) is taccebtable*®, i.e. high; the value of a state which is inconsistent with 
these physical laws is low, "unacceptable". 


Communications Within the System 
Introduction of Information Theory. 


The relationship between any two members of the organizations as thus far discussed may be re- 
garded as that of receiver to source in a communications system. Clearly, the roles of receiver and 
source may be interchanged, and a different channel is, in general, obtained. The channel capacity is 
given by axiom 5. If the value functions are specified by the observer, the chief and fundamental 
problem in the description of organizations which remains is to describe the extent of communicatien 
within the group. In order to apply Shannon's fundamental results on information and coding*, it is 
necessary to define a probability distribution on the set of alternatives. 


70 


The measure theoretic approach, upon which Shannon's definition of uncertainty or entropy is based, 
is very useful in establishing limit theorems, but the Borel fields and the measures defined on these 
must be known a priori or must be determined by physical methods before the results can be applied to 
the systems treated here. In cases where either no assumptions at all are warranted, or where those 
assumptions that may be made about the a priori measures and sample spaces are of such a nature as not 
to be amenable to analytic treatment, an approach along the lines of distribution-free methods seems 
relevant. That is, the uncertainty which each member experiences about the possible moves another member 
might make, must be defined entirely in terms of the data which is stored by that individual. 


Consider any two members of P, P; and ine let s re denote that element from the set S a which was 
recorded by Ps during the t time unit, or observation period. Since P, is assumed able td make at 
most one choice during this period, s i is a particular value of the variable S36 It has already been 
mentioned that Ps is assumed to be abe’ o decide whether s ¢ and S45 41 are eq al or not. Equality, 
which in this case connotes the ability of P; to recognize dng recurrénée of an s 4» can be taken as an 
undefined term, as part of axiom 2. It is further presupposed that each P; is able to count the number 
of such recurrences and to operate with these numbers according to Peano's axioms.? It is only in this 
manner that frequencies can be meaningfully defined for Psy and used to utilize his storage capacity 

and transmission rate best. It seems as though most of the essential features of computers had to be 
postulated in defining each P,;. 


To simulate Shannon's definition of unconditional entropy, it is expedient to define the frequency 
with which sj; has occurred during the first t trials, denoted by f,; 4(s 4), as the ratio of the number 
of times that s4; has recurred up to time t, tot. It is noteworthy that*this is the first point in 
this model where numbers must be used, the f's being defined on the field of rationals, the properties 
of which may be used in analysis. The formula 


Ogi or ors £54,468 5a Logs £54,4(851) (1) 
Si Si 
expresses the uncertainty experienced by P., on the basis of information in the data matrix Ds A 
about which element of set S44 P, will choose. Any change in the frequency distribution towabds greater 
concentration decreases U 4.t and the difference in the uncertainties may be taken as a measure of the 
amount of information gained: P; may control the change in these frequencies by repartitioning the set 
iS) 
jie 


An expression for the conditional uncertainty experienced by P., on the basis of information in 
Dit,» about which element of S,; P5 will choose in response to the Succession of states 
which occurred at times t-l, t-2,..., t-T, is given by 


U53,t(Ges Gres 99 G7) = a £34! 855/Ge-nGres ee 1G.r)10Bof 5548 55/Gers sie Gc7) (2) 
Heese 

where f 4 4, ¢ [Gray 00-3 Ger is the conditional frequency of S53 given GwGes wae Cus If the set 
S 4 rematag unchanged in time, the frequencies and the uncertafities may stabilize. Before stabilization 
nds taken place, however, Ui s 
of ss; occur which have not ocourred previously, S;; is altered in that the number of elements or cate- 
goriés in it is increased by one. Thus, although no uncertainty is removed in such a case, it is reason- 
able to assign to P; a gain of information of the amount of the number of bits required to store the 
new element. 


: behaves like a random variable, which is also rational-valued. If values 


Storage and Channel Capacities. 


Suppose that each of the sets S,, consists of m elements at time t. If only elements of these sets 
which have previously occurred apuests the system may be found in any one of states. Hence there are 
m’" ways in which the condition appearing in equation (2) can be realized, and accordingly many items 
must be stored. The case of unconditional entropy, in which no past state of the entire system is needed, 
can be considered as the special case T = O, by definition. By time t, assumed much greater than T, P. 
need have storage of at most I bits, being reouired to store at least the following quantities: 
N log m bits which P; may receive and classify in S,;,...,Syj during period t. ? 
N log m bits with which the above must be compared in order to classify. These are permanently stored. 
2NT log m bits as standards for comparison and for receiving the particular states Gy, G¢2,--. Ger 
which occurred for the last T observation periods. Half of these bits are stored permanently. 
N(T+1) log m bits to store the value to P; from the present and the last T states. If it is assumed 
that the same value corresponds to two recurring combination of states, this number is multi- 
plied by the number of such combinations which have occurred up to t. At most t-T such combi- 
nations may have occurred, observation having started at t >T. 
Nm log (t - T) bits to store the frequencies which appear in formla (2). 


I = N(T+1)(2+t-f) log m + Nm log (t-T) bits (3) 


1h 


The logarithms in formula (3) and other eouations where not explicitly stated shall be understood to be 
to the base 2. In equation (3) the quantities m and T may both be functions of the time t and the in- 
dividual whose storage canacity is defined, i. N, the number of members, might be regarded as a function 
of time, t, in general, since some members may drop out of the organization, and some could conceivably 
enter also. 


It will generally be a rare occasion which requires as many as I bits of storage at time t. 
Shannon's fundamental theorem may be applied as follows: Regard the entire system, P, acting for THtime 
units as the source, the output of which are the possible states at t which follow the sequence of 
fixed states 6,6, Gz» The entropy of this source is given by formula (2), and shall be denoted by U 
bits / symbol 844 ‘ for short. P, is then able to so encode the s,, as to store c¥ / U- @symbols, where 
€ is arbitrerily. small, on the average; but more than c¥ /U symbols cannot be stored. c*® is a nodified 
value for the capacity of P,; it is c; minus the number of bits required to store the v3 And the fre- 
quencies themselves, as well as any orders and standard comparison symbols that are necessary for the 
determination of U and the actual procedure of optimum encoding, such as the Shannon-Fano code. 


It should be noted that c; does not change with time, and may be considered as one of the funda- 
mental parameters which characterize P,.c;, then, determines the maximum number of other members with 
whom a given P, may be in contact in the sense of there existing a set S 33 it also determines the 
maximum amount of information P, can theoretically gather about the behatiour of another member, in the 
sense that it determines the limit on m, the number of elements into which Py is partitioned, beyond 
which further partitioning is valueless. Considerations quite similar to the above can be applied to the 
rate r;, which is another important parameter characterizing P.. r, may be regarded as the channel 
capacity in the above formulation, and Shannon's fundamental theorem applies there also. c; and r. 
should, of course, be related, for a large value of ry and a small value of c; is quite a useless com- 
bination. 


Formation of Patterns. 


Two kinds of patterns may be distinguished: the sequence of states of the system, as observed by 
Py for the last T time units, may exhibit temporal regularities; for instance, certain states may recur 
with a definite period. Since this kind of pattern is primarily an evolutionary phenomenon, it will be 
mentioned here only in relation to the second kind of pattern: the networks of members with their asso- 
ciated orders and efficiencies, as known to the outside observer into which the system P decomposes, may 
have certain topological regularities; for instance, a system of 10 members, each of whom communicates 
with exactly one other member, decomposes into 5 couples which act independently of one another. 


To make the notions of efficiency and directed communication more precise, it is convenient to 


U Ligi 1 NV 
define: oe 1 ji,t e e e > e 
. fs = -_ So ’ == ; > a 
aoe log, M43 ¢ tet “Nod -tnile eae ULSD thet 


U 4 is P,'s uncertainty about P,, measured up to time t, according to formula (2). m,5,is the number 
of dtements in S;; by time t. e; J can be taken as a measure of the efficiency with witch Ps can 
communicate to PS by time t. If ints quantity is near 1, P, communicates to P, to a large extent; P 
will be quite certain as to whish element of S 4 P, select8. If 83 4 t 
and Ps are in communication with one another ta a tence extent. The*éxtent of communication in such a 
pair hey be measured by the average of the two efficiencies, e.g. 1/2(e. + @s, 4). @; 4 represents 
the average efficiency with which P,'s environment (the system P minus BY} * co Gates t6 tes 


e is an indication of the extent ot communication or efficiency of the system as a whole. 


is also close to 1, then P. 


Because of the limited storage capacity of P., the number of others with whom P, can communicate 
to a specified extent is a determined quantity. AS an example, suppose that each P; can communicate 
with as many as3 others. Then the following possible network patterns may be present: 


AO tee 


Each dot indicates a member, and a line connecting any two dots means that two members are in two-way 
communication to a non-negligible extent. b and c are both examples of closed simple chains, where 
each P, may communicate with two others; open chains of this sort may also exist, the members at the 
ends béing of group a. d is illustrative of any polygon in which three lines emanate from each point. 
Patterns of type e may again appear in open or closed chains. A large number of patterns like g can 
easily be visualized. A system composed of a large number of members may decompose into any number and 
variety of such patterns; the patterns of the closed type, like a,c,d are not in communication with 
one another, in the sense that no individual of one co:municates appreciably with some individual of 
the other. People in a ballroom are an illustration of this. In a pattern like e, on the other hand, 


72 


the decomposition may be regarded as one into triangular clusters which are weakly connected; one member 
of each cluster communicates with one member of some other one. The preceding considerations could also 
be applied to patterns in which the lines connecting two points are directed line segments, indicating 
the extent of one-way communication in the direction of the arrow, and proportional to its length. A 
diagram quite similar to an ordinary sociogram is obtained. It is quite clear that neither one nor two- 
way communication need be transitive relations, in the sense that if P, communicates to P, and i to P 
then P; would have to communicate to Py, also. If transitivity were postulated, isolated cflusters“of the 
type a,b,c would be obtained. 


In addition to the static patterns discussed above, it is possible to describe patterns of choices 
within a particular state of the system. If the system stabilizes, i.e. reaches a steady state, such 
a pattern is quite likely to obtain. In addition, a pattern may change with time, and recur, as in the 
case of oscillating systems; thus, a system may be descibed in terms of the communication and choice 
patterns at any one period, and also in terms of temporal patterns, which are essentially recurrences 
of the communication and choice patterns in definite periods. There is yet mich to be done in this area 
in the direction of relating the possible decompositions of a system into patterns to the storage capaci- 
ties, the value functions, and the channel capacities of the P,- 


Results _ 


Uniform Convergence of Frequencies. 

Definition: Abbreviate f5; 4(3; 14 / CroGuas+0e,Gea) by £,(5). f ep) is uniformly convergent int, if 
for any rational€ >0, thesé é exists a T', dependent on be, but $ndependent of j, such that t >T! 
implies that | f,(j) - bel <€ , for any positive at : 


This definition is evidently not satisfied by any finite sequence such as is available to the P; 
unless some assumptions about the infinite sequence are inferred from the finite beginning. This is 
tantamount to an assumption about the regularity of P,'s behaviour, but one that is subject to continual 
revision and verification as the observation time progresses. A small number, € =&,, could be chosen, 
and the above definition could be weakened to hold for all but a small fraction © of values of t which 
lie between T' and the largest t for which data is available. 


Theorem: If f,(j) converges uniformly to f(j), then U, ayes (j) log f4(j) converges uniformly 
to Us=s-5 f(j) log t(j). 
Si ESji (Spe € Sic) 
Proof: It is necessary Co-show that: for any € > 0, there ath _ antec T(E ), such that t >T 
implies that 


| £4(5) log £4(5) - feer(J) log feip(d) | = | log x, »| < Bio (4) 
+T 
When f,(j) = 0, define f,(j)ft(J) = 0. feer'f? 
Let se = tin fe(3)s° 
4 
Choose €! such that é' < SE , where € is arbitrary. It is possible, by the ty hee to 
find an integer me) rap that t > T(€’), T positive, imply that | f,(j) - fy lI) |< €& 
Since f,(5) < 56)" "215 ra aders 
t Ex pet ae t ‘i ce 

Choosing T(E) > T(E) , pee: and Bae = < 5 


so that (4) is verified. Because of the uniform convergence, the summations may now be performed 
term by term, and the result follows. 


The theorem can be generalized to the case where S34 is a measurable set of finite total measure ,/4, 


f,(s) a measurable function on S5i> and 
U,=-Jf £63) log f 46S) aut s ) 


Bre 
The integral is an abstract Lebesgue integral. The “proof is essentially the same, and based on the fact 
that a uniformly convergent series may be integrated term by term. Furthermore, the hypothesis can be 
weakened by dropping the assumption of uniform convergence, because the Lebesguw convergence theorem 
guarantees that the operations of limit and integral may be interchanged, If,(s) log f _(8)| being bounded 
by 1 for all s in Ss4° 


It is of some interest to study the rate of convergence. Let Tf $&) be the smaliest integer such 
that t > T,(S€) implies that | eh) - te atin <§€ . Then this integer also represents the smallest 
number of terms in the sequence j)f} , which will make (4) valid for t > T (S€). This is true 
because if €' < &€ were chosen, T rie. the smallest number such that t >T ae! 5 implies that 
| f,(3) - feyr(J)] < é, then would. be larger than T o SE)» T ats being a monotonically non-increasing 


73 


function of €. 


Theorem: If T,(€) is the step-function of € defined above, then t > Ta(.€:) implies that 


Uz - U 
t t+. baths: 
T 
Proof: Define AU, =U,4,,; -U, . Then t' >t implies that AU,, < AU,. 
To prove the above a iacanedts suppose the contrary: AU,, >AU,. Let E be | Av, | 


and find T,(€ ), the smallest integer such that ty, t> >T, imply that 104, - Ups] < € 
Clearly, t Ty » Whence t' >T, . Taking t], =t', and to = tit +eL, 


[Au.,|< € or (AU <!4U,I, which is a contradiction. 


Now, | U, - Ups le 1U,1T. 
For t > Ty [A UI< € and the result follows. 


By means of this theorem, the smallest number of terms can be found which stabilize the uncertain- 
ty in the sense that its time rate of changsisless than an arbitrarily prescribed €. PS can further 
reduce his uncertainty about Ps by repartitioning S44 into a finer net. 


If, in fact f45 £6854 /G ) is a uniform distribution, and S;; is, for example, 
3 
=! 
Sq =f 10,4) Ly Hedy vale LY 
then it is easily shown that the best repartition, in the sense of removing the most uncertainty is: 


SO g5) beet Be pne Lee 


For, Uys + 7 log m3 letting the new m' = f(m), AUVs, 4 = log f(m) - log m. To obtain f(m) such 
3’ 


that ten LN Usa t is maximun, 
2 d@AUji,t 1 , df(m) is 0 
dm f(m) dm 7 temesla 


whence log f(m) = log m+ log 2 , since f(1) = 2. Therefore, f(m) = 2m. 


Two — member systems. 
Are there any relations which govern the uncertainties which follow from the assumption that each 
Py tries to maximize the value to him? It is instructive to consider the special case of N=2. 
ASsume that the members, herein also called players, have played so long that PL knows that 
if he chooses ‘Sj; and Pp» chooses 5S,» he wins yg ; 


(2) 
" Su " Sy -r v ; 
nd o (21) 
" 
W 5; oa) tt V, 3 
@) 
Ws Si " SG " UH e 
) > w Assume also that P> knows that 
if he chooses 5,, and P, chooses 53 » he, Po, wins V," ; 
a) (2) (2). 
" S 3 tt ws " vi ; 
(2) , @) 
n ae " Spa " yen; 
) 
" ne " oe, it Ver, 


(a2) ( 
Suppose further that Py "guesses correctly" with frequency qd, that Po chooses Sz, , hence Sat 
with frequency 1-q,. It Would be reasonable to assign to Q, some a priori value such as 


) 
- Max Live? + VO?) Vr+ V9] 
Maxil Wie Biever) 
To simplify the notation, as well as the calculations, let it be further assumed that $91 = 811> 
312 * ace and Vo = — Ve The problem now assumes the form of a simple rectangular two-person, zero-sum 
game, h 


a 2x 2 utility matrix, in which each player does not choose in complete ignorance of the 
other's choice. It is not difficult to see how the results can be extended. 


Ao 


Let py be the frequency with which P, chooses $11 such that his expectation, 


(1) 2) QI) (22) 
Ey = Le +4 d- Ge] + 1p yr'G eV C-G] 
(2) } 
is maximized, p, being a function of q,, f(q,). If 4 > V2. the solution of the resulting differential 


7h 


equation on f, subject to the initial condition f(0) = 1, is p, = Aq | + B, where A and B are functions 
of the v's. If Vere V5 the trivial result p, = 4) obtains. 


If, now P,4, after some time, chooses so as to alter P,'s frequency of "guessing correctly" from 
A, to a > qy is a function of p, such that Po's expectation is maximum, This statement is now exten- 
ded to an inductive statement, relating P, to q, and ahs to Pry >» expressible as the following pair 
of recursion relations: 


VALID = hs VAC yo 
eee VC yZ en us Ye ¥ igi t 


V 2) _ V (22) 


The solution to these, in terms of the initial value a,» OF Po is given by: 


t-/ t t+/ 
py "eRtipsia: ‘ono kities Gt Gee eer Ges (5) 
k=0 /-K 
wheres Cc yo Vgsledy Ve yy (22) (VV gee Vi aaee | 
= yer _ ys, as (VE VeYy(V A yor) 


There are certain restrictions of the v's to insure that Py remains between O and 1. 


These are 0 =C<£1-K, or in terms of the v's: 
(229) 
O's Pye fiend ai ye OYE! ys?) & (VLA ye psy fee ley Pts yan) 
1) wl (rz) x 
If the utility matrix is symmetric, V’ = yeX, and K = Via yar) = ; C= 6, and py = p, 


for all t. On the other hand, if|Ki< 1 , the dependence of p, on the sie value disappears’ Under 
very mild restrictions on the v's, the sequence {p,f converges uniformly, and the previous results 
apply. Thus, if)K| <1, a limiting expression for the entropy can be found, which is independent of 
the a priori uncertainty of Py about Pi) or P, about Po. 


/-K gear! Ee 
nia Sime ae Gy tage 


A similar relationship holds for the other player, with q, in place of p,. When the restrictive 
assumptions which were made above, and the general case of N instead of 2 players is considered, @ 
system of N simultaneous difference equations must be solved. The results will then permit the com- 
putation of the efficiencies, and the extent of communication, e,, in terms of the assumed value functions 
for each P,. The capacities c,; and r; will then determine the possible patterns of communication, as 
discussed before, and permit the computation of the efficiencies associated with each. 


Conclusions. 


The main limitations of this model are that they represent over-simplifications of most existing 
organizations, but these can be gradually dropped as the theory is further developed. The manner in 
which the outside observer determines the value functions, storage capacities, is Similar to the way 
in which psychologists test subjects to obtain data about subjective judgments, rates of learning, 
capacities of retention, etc. The limiting processes which are used in this model are subject to the 
same criticism as those in von Mises' frequency approach to probability. 


The chief values of this model may prove to be the readiness with which the methods involved are 
adapted to treatment by digital computers, particularly with regard to the data matrices and the differ— 
ence equations. The memory requirements will, of course be large, to handle any organizations of interest. 
It also becomes easy to formulate many problems , such as the analysis and synthesis or organizational 
structures, group coheremce, leadership, learning, etc., which could not be simply formulated otherwise, 
and experiments to test the model in such applications are readily suggested and designed. 


Bibliographical References 
(1) Arrow, K., Social Choice and Individual Values, Wiley, N.Y. 1950 


(2) von Neumann; 5 0.) & Morgenstern, O., Theory of Games and Economic Behaviour, Princeton, 1947 

ah ae ie ae a uM Stochastic Model With Applications to earring’; Anu» a, Brat 2k 
annon, C. eaver, W., The Mathematical Theory of Comminication, Univ. Illinois, Urbana, 

(5) Landau, ., Foundations of Analysis, elsea, N.Y. 

(6) Fano, R.M. "The Transmission of Information", Tech. Report No. 65, Res. Lab. of Electronies, M.I.T. 

(7) Halmos, P., Measure Theory, Van Nostrand Co., N.Y. 1950 


15 


SIMULATION OF SELF-ORGANIZING SYSTEMS BY DIGITAL COMPUTER * 


B. G. Farley and W. A. Clark 
Lincoln Laboratory, Massachusetts Institute of Technology 
Cambridge, Massachusetts 


ABSTRACT 


A general discussion of ideas and definitions relating to self-organizing systems and their synthesis is 
given, together with remarks concerning their simulation by digital computer. Synthesis and simulation 
of an actual system is then described. This system, initially randomly organized within wide limits, 
organizes itself to perform a simple prescribed task. 


INTRODUCTION 


Information systems whose response to a given class of inputs changes with time in accordance 
with specified criteria which are chosen to correspond roughly to the "self-organizing" concept have 
been the subject of considerable interest.”’!? Several mechanisms have been constructed or described which are, 
“self-organizing” to some extent,'’®’®’’!and some work has been published on computer-programmed learn- 
ing, such as that by Oettinger.5 Recently, McKay has communicated ideas related to some of those to be 
discussed here.* 


The work to be described was undertaken in an attempt to clarify certain ideas related to such 
systems, and to try to gain some insight into their synthesis by simulation of specific systems using a 
digital computer. Although the work is in an early stage, it is believed that results so far have ex-— 
hibited some very interesting properties of a particular system, and have demonstrated the usefulness of 
computer simulation methods in studies of this kind where systems are likely to be so complex that ana- 
lytical solutions are difficult or impossible, or do not furnish much information until leads are sug- 
gested by actual experience. 


The work will be presented in three parts. First a general discussion will be given in which 
definitions will be made. Second, the definitions will be applied to an experimental system. Third, the 
details of a self-organizing system and a description of computer techniques used in its simulation will 
be given. 


General Considerations and Definitions 


In order to make our ideas and definitions precise, and at the same time as general as possible, 
it is convenient to introduce a mathematical framework to aid in discussion. 


We will deal first with a general systemss shown in Figure 1. Inputs p, from the left are trans-— 
formed into outputs q, on the right. As indicated in Figure 1, both input and dutput lines may be 
multiple. In what follows, the symbols p, and q. will refer to specific, complete configurations on 
these multiple lines, finite or infinite in time? No loss of generality will result if all signals are 
reduced to a binary equivalent. As an example, then, if there are three input lines, a certain input 
might be defined as 


0110001011001 (1) 
Pg = {1001101011001 
0110101000000 
time increasing to the right. Such a configuration will be called a time-channel pattern. 


The transformation T will be allowed to change with time, and we are interested in this change 
in so far as it exhibits organizing properties with respect to T. To define properties of this type we 
will fix our attention on a particular class C of inputs p;,1,2,...,n and their corresponding outputs q,. 
Each member of this class will usually be finite in length‘ J 


We may then break the transformation T down into a class of transforms 


T= {r,, Ta, aaa (2) 


*The research in this document was supported jointly by the Arny, Navy, and Air Force under contract with 
the Massachusetts Institute of Technology. 


76 


where the set of equations 


ie (3) 


serves te define Tj-«eT, If the system contains sources ef noise or produces spentaneous eutputs, 
the transformations T,...T will be defined statistically as averages over an ensemble of identical 
systems started in the same initial state. 


Now, in order to discuss one or more properties of such a system dependent on time, it is only 
necessary to choose a measure m specifying the properties in question and apply it to T at succeeding 
times. In many cases, the measure m(T) will of course depend upon only one, or a few, of the t 8. 


We will consider that the instrument of time change of T is within the system. As an aid to 
visualization, Fig. 2 shows the system broken down into two components, one of which contributes prima- 
rily to the transformation T itself, and the other, called the modifier, has the primary function of 
producing changes in T. The double line between the modifier and T represents the agency of the modifi- 
cation, while single lines show infermation paths. While a sharp dichotomy of function between T and 
modifier has been indicated, it is not intended to exclude systems in which the modifier contributes to 
the transformation or modifieg itself. We will consider everything outside the dashed lines as “environ— 
ment, although it should be noted that the exact path of such boundaries is arbitrary. 


It may be of some interest to suggest as an illustrative example how the above model as described 
might be used to describe situations which approximate psychological definitions of “learning.” 


Learning behavior by an organism may be defined for the purposes of psychology as a positive 
change in the proficiency of performance of one or more tasks as a result of, prescribed experience.® 


In terms of our model, we may describe this as follows. A number of input patterns are chosen, 
and presented to the organism in prescribed orders and times to provide the required “experience.” One 
or more of these inputs are designated as performance tests or tasks, and a suitable proficiency measure, 
such as a test score, is constructed. This score corresponds to the measure we have attached to the 
general transformation. If the measure increases as a result of presentation of “experience” inputs, and 
does not increase otherwise, the organism is said to learn. 


The provision that the measure should increase only as a result of presentation of “experience” 
rules out as learning systems those in which the modifier operates to increase a measure without informa- 
tion inputs. Control experiments may of course be required to rule out such cases. 


It should be noted that learning thus defined is relative to the input class and measure chosen, 
and the experience prescribed. By varying these parameters, various kinds of learning may be defined. 
For example, transfer learning requires altered experiences or measures of new performance (perhaps with 
special control); learning with relatively short performance inputs is called conditioning, while that 
with long performance inputs is called serial learning. More precise definitions would require close 
examination of the variable parameters. This task is complicated for the psychologist by the fact that 
he is dealing with organisms no two of which are alike. 


Some competing theoretical interpretations of learning may also be referred to the model. For 
example, reinforcement and non-reinforcement theories make different assumptions as to the nature of the 
modifier organization. “Perception” theorists make use of “perceptual systems or fields" which are not 
explicitly represented in our model as presented here.? 


No matter how complex the organization of a system such as we have been discussing, it can always 
be simulated as closely as desired by a digital computer as long as its rules of organization are known. 
This possibility is indicated for example, by the work of Turing.’° This means that the action of any 
system can bé studied even though it is too complex for mathematical analysis. Furthermore, the computer 
offers unparalleled flexibility in such work, since any part of a simulated system may be quickly and 
easily modified to judge the effect of the change. There is of course the disadvantage that present 
computer simulation takes pace serially in time, so that even with very fast computers considerable time 
may be required to simulate highly complex systems. Balanced against this disadvantage, however, is the 
fact that the initial programming for simulation in general requires a great deal less time than actual 
construction of an analogue device even if this is feasible, so that for a very wide class of problems 
the net advantage in both time and cost lies on the side of a computer simulation method, and for an 
additional large class this method is the only feasible one, at least until the system is reasonably well 
understood. 


tt 


The work to be described was undertaken partly to examine the problems encountered in such simu- 
lation. Furthermore, it was desired to answer two questions: (1) Can a transformation, initially or- 
ganized at random between rather wide limits, be provided with a modifier which will cause it to become 
organized, as a result of experience, to perform a prescribed task? (2) Can such a system be generalized 
to organize itself to perform any of a rather wide class of tasks? The work to be described is still in 
an early stage, but has resulted in the synthesis of a system which it is believed fulfills the require— 
ments of the first question. 


Application to an Experimental System 


In seeking to synthesize systems along the lines discussed above, it is natural first to choose 
a transformation with promising transforming and modifiability possibilities and then try to discover 
suitable modifiers. Preliminary investigation showed that transformations composed of interconnected 
active non-linear elements with definite thresholds as indicated in Fig. 3 have interesting transforming 
properties. For example, such a net of elements can change a time-channel pattern into a space pattern 
of active elements, and if it is complex enough, can do this uniquely for a given class of patterns. 
Furthermore, enough variable parameters are available in the net to give it useful modifiability proper— 
ties. Networks resembling those under discussion exist naturally, and have a great intrinsic interest, 
namely networks of nerve fibers or neurons.” It was therefore decided to use non-linear elements pos— 
sessing many of the known properties of neuron nets as an experimental transformation.® The details of 
the net and associated modifier will be presented later, but first the simple task chosen for performance, 
the measure of proficiency used, and the prescribed experience, will be described in terms of the frame— 
work already discussed. 


First a randomly connected net is arbitrarily divided into four groups of elements designated as 
groups I.) I_, 0(+), and O(-). These symbols stand for input groups “a", "b"“, and output groups "+", and 
saa .ft respectively. 

Two input patterns, p, and pg are considered. The first, Pi, may be represented by the following 
scheme, 


2 -00100100100... 


@eeceosecoereosre res oe 


« «200100100100... 
I 
a 


«2 00000000000... 


@eereeseoeresreee ere 


22 OOONNNNNNNN... is 


which indicates that the same periodic input is applied to every element of I , and that no input is 
applied to I_. The input pz is identical except that the roles of I. and I, are reversed. When p, is 
applied, the transformation called T, is active, and Tg is active when pg is applied, in accord with 
equation (3). In order to define a proficiency measure, we proceed as follows; Let n(+) be the number 
of elements active during a given time interval in group 0(+), and n(-) the number active during the 
same interval in group O(-). 


The measure m(T) will be composed of two components, m, and m. 
n(T) ={m, ma} (5) 
where Dy. my (T, ) n(+) — n(-) (6) 
Tg M(Tz) = n(-) - n(#) 


and the bar denotes a time average over a fixed interval. 


Note that m, is defined above only when T, is active, and similarly for mg and Tg. Organization 
will be said to occur if both m, and mg increase. 


In other words, we may consider an output formed by the accumulated difference of the numbers of 
cells active in O(+) and O(-). Presentation of experience will be externally arranged so that p, is 
applied whenever the output is positive, (0(+) predominates) and pg whenever the output is negative, (0(-) 
predominates). If the output remains near zero for a specified length of time, it is externally “forced" 
from zero by adding to the output difference in alternately positive and negative directions. Thus the 
whole mechanism is similar in some respects to a servo which must learn to return to zero when displaced, 
training experience being given alternately on either side of zero, and increasing organization being 
manifested by an increasing rate of return. The patterns p, and pg provided by the environment may be 
said to enable the mechanism to “sense"the position of its output. 


The modifier which causes the measures m, and mg to increase was determined largely empirically. It 
operates on various parameters of the net in a way to be described later. Information for the operations 


78 


of the mdifier is generated internally in this simple case in a manner which essentially computes m, and 
mg. However, it should be mentioned that in the general case this may not necessarily be true. That is, 
the modifier may use information related to the organization measure, but computed in some entirely dif- 
ferent manner. 


Details of Experimental System and Simulation Program 


The general properties of the particular transformation and associated modifier with which the 
initial simulation work has dealt have been presented. This description will now be expanded and related 
to the computer simulation techniques which were developed for the Memory Test Computer of the Lincoln 
Laboratory of M.I.T. A note on the cheracteristics of the computer may be of general interest: MTC is a 
16 bit, parallel machine with a coincident current magnetic-core memory of 4096 words and an operating 
speed of about 90,000 single-address add-type instructions per second. Its principal input device is a 
Ferranti Photoelectric Reader for punched paper tape and output equipment includes a standard flexowriter 
and several cathode ray tubes for displays which may be photographed. 


The transformation system has been described as a network of non-linear elements in which the 
pathways or connections are randomly established. In this and other parts of the program random processes 
play an important part and should be discussed in more detail. MI™ does not have access to a random 
element, but there exist many accredited computation routines which generate number sequences in which 
the values of the terms are distributed in a nearly stat stically homogeneous manner. The pseudo-random 
number generator routine which was used develops the n terms , RW by means of the recursion relation’ 


R =R teRe (Sum modulo p) (7) 


n n=] k 


The series initially is "primed" with k terms chosen from a table of random numbers. 


To connect network elements at random a matrix P., expressing the probability that i connects j 
is established for the class of networks under consideration. In the systems to be discussed, the simple 
case P,. = K, constant for all i, j was chosen, but more generally the connection probability might depeni 
on i, AJor any particular characteristics of network elements i and j. For each pair of network elements 
a pseulo -random number in the interval ab is then generated and a test is made to determine whether the 
number lies also in a subinterval ar of ab where r is so chosen that the ratio of (r—a) to (b-a) is the 
probability Poae Since the pseudo-random numbers are uniformly distributed in the interval, this test 
yields positive results with a mean relative frequency equal to the required probability. For each 
positive test result, a connection is established and in this way a specific connection matrix, (c, =l1 
if i connects j, O otherwise), is set up for the given network. K will be called the connectivity dr the 
net. 


With each connection there is associated a sixteen-state weight, “5 which determines the excita 
tion value on j of activity transmitted from i via this connection (see figs 3). These weights may in 
general be drawn from a distribution in the manner discussed above, although in the example presented 


later these weights were chosen equal and set initially at mid-—-value. 


For each element in the network, one row of the connection matrix (representing pathways from the 
element) and a listof associated weights are stored in the computer memory. This requires breaking the 
matrix row into 16-bit words and also packing four of the 4-bit weights into one word for storage economy, 
and much of the computing time is consumed in the unscrambling and repacking of these words during the 
simulated operation of the network. 


Similarly, with each network element there is stored a list of characteristics such as threshold, 
time constants, etc. selected from appropriate distribution functions, and addresses and counters re- 
quired by the simulation process. These quantities occupy another six 16-bit words. The total storage 
requirement , however, is determined largely by the connection matrix and associated weights, since for 
these the required capacity increases with the square of the number of network elements. The 40% 
registers of the MTC memory limit the size to a network of about 128 elements with connectivity of 0.4. 
The time required to generate sucha net is approximately ten seconds, or about 900,000 operations. The 
complete simulation program occupies about 1500 registers; the remainder of the storage is occupied by 
the characteristics of the network. 


In order to elaborate on the characteristics of the network elements it is necessary to discuss 
more completely the transient behavior of the element during excitation. This transient state of activity 
occurs whenever the excitation level exceeds the threshold of the element. After a small time delay, the 
element transmits by simultaneously increasing the excitation level of all other elements to which it is 
connected as indicated by its associated row in the connection matrix. At the beginning of this delay 
interval, the threshold rises to a value which is largemough to prevent a second activation during the 
interval. At the end of this interval, suggested by the refractory period in neurons, the element re= 
covers sensitivity as it threshold decays exponentially to a minimum value characteristic of the element, 


19 


measured tplative to an adjustable bias level for the network as a whole. The threshold function, h.(t), 
for the j element may thus be represented as effectively infinite during the refractory interval atid 


h(t) & A axe xP(-a,t) - Ds nin “4 Dy jag’) (9) 


otherwise, where a, igs the threshold decay constant. The comparison of excitation with threshold occurs 
in the presence of gaussian noise such that a high level of excitation increases the probability that an 
element will "fire" but will not in general completely determine the instant of firing. The behavior of 
the network becomes completely determinate as the mean-square amplitude of the noise, which, like the 
bias level, is controlled by the modifying sub-system, is reduced to zero. The gaussian distribution is 
approximated, as suggested by the central limit theorem, by averaging a setoffour pseudo-random terms 
for each term of the"gaussian"set. 


When several elements simultaneously transmit to the same element, the change in excitation of 
the affected element is chosen to be the sum of the weights of the active connections, although a more 
complicated function of the weights might be used. In addition, the total excitation level at the 
affected element decays exponentially with a time-constant characteristic of the element. Thus, activity 
pulses arriving within a small time-interval of one another partially combine in excitation value in a 
manner retated to the observed temporal summation effects in neurons. The change in excitation, he ,(t), 
at the j element at time t may then be written 


he ,(t) = mi e,(t-1) + 25 (9) 


where the summation extends only over elements which transmitted at t-l excluding, in the model chosen, 
i=j, and Ls is the excitation decay constant. 


h 
The characteristics stored in the computer memory for the a3 network element can now be enumer— 
ated in summarye 


(1) Type of element, i.e, number of the group I> O(-) etc. to which the element is assigned. 


(2) Time delay, which determines the refractory period, and also, in the simple model chosen, the 
delay between firing and transmitting (equal for all pathways from the transmitting element). 


(3) Minimum threshold, B smin 

(4) Threshold decay constant, a, 

(5) Excitation decay constant, LF 

(6) Connection Matrix row, Cie k=1,2,..,n where n is the number of elements in the network. 


Wye? for which C41 

It should be pointed out that as yet there has been no systematic evaluation of the effects of 
varying thresholds, decay constants, and time delays. Their inclusion in the set of characteristics does, 
however, illustrate the degree of complexity of the model being simulated. 


(7) Those connection weights, 


In the simulation program the time variable is quantized into equal intervals of about one-eighth 
of a refractory period. This time parameter is, in effect, frozen until the program has scanned through 
storage, calculating values of threshold, excitation, etc. for each element, after which it is advanced 
to the next larger value. The real time consumed per "time" interval in carrying out these calculations 
for a net of 128 elements with connectivity of 0.4 is about one second, varying from interval to interval 
according to the amount of activity within the network. 


Activity is introduced into the network by increasing by a large fixed value the excitation level 
of those input elements, and at those times, indicated by the presence of “ones” in an input pattern 
similar to the p, of eq. (4). The output, as described earlier, is formed simply by counting the number 
of transmitting elements in the output groups 0(+) and O(—) during each time interval. The difference of 
these numbers, n, (+) -n,(-), defines the changes in the output N of the net so that 

Noa 7M, * a,b), (-) (10) 
The computer program is arranged to plot N, against t directly on one of the display scopes and a time 
exposure photograph records the trace. A typical output record appears in fig. 5. In order to automatize 
the process of presenting input patterns, the simulated external system is so arranged that N,>+ N' re- 
sults in pattern p, and <a produces Pg where N' is a small positive number. If N, remains in the 
null interval between ~-N' “and 4+N' for a specified period (chosen long enough to allow es idual activity to 


attenuate) sera Ber ape displaces N, to some value +N">N' with alternate trial displacements to -N". ese 
displacements 1) be seen as the “discontinuities in fig. 5; 


The action of the modifying system is best described by means of the flow diagram of fig. 4% If 
a“contributive connection" is defined as any active connection to a "fired" element which may have con- 
tributed to the firing of the element during an immediately previous fixed time, the modifier increases 


80 


the weights of contributive connections when the magnitude of the output has just decreased and decreases 
these weights if the magnitude of output has just increased, subject to upper and lower bounds of weight 
value. Note that weights are changed without regard to their individual influence on the output, and 
improvement in performance results from what might be termed "statistical cooperation.’ In addition, the 
modifier manipulates the threshold bias level and the noise level within the net, the former by gradually 
lowering bias until activity starts (principally to prevent self-sustained activity, which is difficult 
to control) and the latter to allow noise-initiated activity to scan, in effect, new activity modes of 
the network when required. Bias control of this sort may be considered use of a "field" parameter, in 
contrast to use of local cell parameters. 


A Small Network Example 


An example of an eight element network of 0.75 connectivity will now be given. To simplify the 
network for illustrative purposes, the elements are divided into four equal groups and numbered so that 
elements in the same group are represented by successive rows in the connection matrix. In this example, 
oe? ae” for all j, and the refractory delays were all equal to 2 time units. 


Fig. 5 shows the history of the output of this network during the organization process requiring 
approximately 15 minutes of computer time. (The graph is redrawn from a set of photographs which were 
unsuitable for reproduction). Up to the point marked “modifier activated", the behavior of the unaltered 
transformation is seen to be slowly divergent for both positive and negative test displacements. Figs. 
6a through 6d show the weight matrix sampled at the times indicated by points labeled "a" through "d" in 
fig. 5. The numbers appearing in these matrices are in octonary form and will be seen to change sub- 
stantially during the organizing process. 


It will be noted that the changes primarily affecting the output occur in the enclosed boxes; 
weights in boxes I_ O(+) and I, O(-) tend to increase while those in boxes I O(-) and I, 0(+) tend to 
decrease. It can be seen the Return to zero of the output gradually improves from a condition of diver- 
gence to increasingly rapid convergence as the matrix changes progress. 


A total of perhaps 30 randomly organized nets of this type with various connectivities have 
actually been tried, the largest of which contained 64 elements with K=0.75. All but 3 or 4 have been 
organized successfully by the modifier, the failure being due to lack of essential connections or other 
special properties sometimes resulting from the wide variability of the random process. 


It might also be of interest to note that exploratory experiments have been made to examine the 
effect of damage on these nets after organization. Indications are that arbitrary destruction of at 
least 10% of the elements may be sustained without impairment of performance. 


Conclusion 


We have now described a general formulation of the self-organizing concept, and a synthetic 
exanple of a system which organizes itself to perform a simple task. 


Although the experimental system was composed of elements having properties similar in many 
respects to the known properties of neurons, it is not claimed at this stage that the results are of 
neurophysiological significance. However, it is believed that the results do show the great usefulness 
of computer simulation methods in this and other fields where systems of great complexity are encountered. 
Not only will simulation methods produce specific knowledge, but it is believed that they should also 
eventually yield enough information about given types of systems to make more general formulations possi- 
ble. For example, enough experience has not yet been gained about the present experimental system to 
understand what features are necessary under given conditions, but it is believed such information can 
be elicited by an extension of the present methods. 


As mentioned earlier, the gradual organization of the system to utilize the patterns p, and pe 
to change an output in opposite directions implies a primitive "recognition" of these patterns. It is 
also found experimentally that after organization other patterns also have effects like p, or pg. In 
other words patterns are classified together by such a transformation. It is to be hoped that, using a 
more complex modifier, this type of behavior can also be organized and controlled, leading to systems 
which effect classifications and generalizations. Success in this respect should make possible systems 
which can organize themselves to perform in an environment presenting a rather wide variety of tagks. 


Acknowledgment 
The authors wish to express their appreciation to F. A. Webster for many valuable discussions, 


and also to those responsible for the operation of the Memory Test Computer for their very helpful 
cooperation. 


81 


REFER@NCES 


Ashby, W. R. Design for a Brain, Wiley, 1952 
Brazier, M. A. B. The Electrical Activity of the Nervous System, MacMillan, 1951 
Hebb, D. O., The Organization of Behavior, Wiley, 1949 


McKay, D. M., (to be published) Oral communication at conference on Human Communication and Control, 
M,I.T., June 1954. 


Oettinger, A. G., "Programming a Digital Computer to Learn. Phil. Mag. 43, 1243-1263 


Shannon, C. E., "Presentation of Maze-solving Machine", Transactions of the Highth Cybernetics 
Conference of the Macy Foundation, 1952, 173-180. 


Shannon, C. E., "Computers and Automata", Proc. 1.R.E., 41, 1234-1241 (Oct. 1953) 


Shimbel, A., “Contributions to the Mathematical Biophyics of the Central Nervous System, with Special 
Reference to Learning, Bull. Math. Biophysics, 12, 241-274 


Stevens, S. S. (ed. ) Handbook of Experimental Psycholo » Wiley, 1951, Chap. 16 


Turing, A. M., "On Computable. Numbers, with an Application tothe Entscheidungspreblem" Proc. Lond. 
Math. Soc., 24, 230-265, (1936) 


Walter, W. G., The Living Brain, Norton, 1953 


Wilkes, M. V., "Can Machines Think", Proc. I.R. B., 41, 1230-1234 (Oct. 1953) 


MODIFIER 


Fig. 1 - General transformation. Passyeeo= ze 


Fig. 2 =- General self-organizing system. 


Fig. 3 = Typical section of network showing weights, w, and thresh- 
olds, h, associated with nonlinear elements i and j. 


82 


Start 


Generate new network frem 
required characteristic 
distribution functions 


Set new time interval 
Carry out all computation and 
info. precessing required to simulate 
network for this time interval 
Examine absolute value ef eutput 


(just increased) (just decreased) 


decrease all increase all 


Contributive connection weights contributive cennection weights 


Determine amplitude of output 


(in null (in pattern 
region) region) 
Y 
Index and examine mull region counter Reset nul] region counter 


(«T intervals) 
Set new output test displacement. 
(alternate + and - displacements 
Raise bias level to initial value. 
Reduce noise level to zere. 
Index and examine sampling period counter 
(Time e Sample) 
Determine existence of change in number of 
active cells in 0(+) or O(-) since previous 
sample 
(change) - (no change} 
v 


Determine existence of change Reduce bias level 


in output since last sample 


(change) (no change) 


Reduce noise level Increase noise level 


Fig. 4 = Simplified computer simlation flow diagram emphasizing modifier. 


83 


modifier activated 


t —. 


| —+| 200 intervals k—- 
Ke = 


Fig. 5 = Output record of an 8-element network; K = 0.75. 


O7 OT} 07 OF 


o7 10 16 
r 


O7 O7 
1O7 04 1114 O7 
16 ae 07 10/01 
03 11 03/02 
Ol 11 07/01 12 


(c) 


ol 


07 10 
I 


cored art 


06 


14 


ce) 
O7 O4/11 12 O7 
18 01);03 07;01 O2;01 
foc ap G3 04/03 O7 
O1 O04 O7}]O1 
16 
Wal Pi 


(d) 


Ol 


Fig. 6 - The connection weight matrix for the 8-element network sampled at points 
labelled "a" through "d" in Fig. 5. 


8), 


A STUDY OF ERGODICITY AND REDUNDANCY 
BASED ON INTERSYMBOL CORRELATION OF FINITE RANGE 


Satosi Watanabe 
United States Naval Postgraduate School 
Monterey, California 


Abstract 


Some of the basic concepts of information theory are critically reviewed in the light of a generalized 
formulation of the theory of Markoff's chains, in which the initial and final states are sequences of 
symbols of different lengths, and occurrence of symbols is governed by inter-symbol correlation proba- 
bility of finite range. In particular, the conditions of ergodicity and the structure of "ergodic sub- 
sets" of sequences of arbitrary length are carefully discussed. A mathematical method is developed to 
determine the "range" and "strength" of inter-symbol correlation. A brief summary of the content is 
given at the end of Section 1. 


Introduction 


The aim of this paper is to clarify some of the basic, but often carelessly used concepts of infor- 
mation theory, viz., the concepts of ergodicity, intersymbol correlation and redundancy. There are two 
approaches to this problem-complex pertaining to probability. One is an empirical point of view, and 
probability here is understood in its statistical aspect. The other is an a priori point of view which 
deals with probability mainly in its predictive aspect. In the first standpoint, the entire population 
of messages in a language is supposed to be given, and the various probabilities are calculated by the 
actual frequencies of individual symbols or those of sequences of symbols. According to this method, a 
unique value of the probability of appearance of a given symbol or a given sequence can be statistically 
determined. In the second point of view, an ensemble of messages is supposed to be engendered by the 
given correlation probabilities starting from a given initial symbol or a given initial sequence of sym- 
bols. In this case, the existence of a unique, non-vanishing value of the probability of appearance of a 
given symbol or a given sequence is not guaranteed, for it may vanish with increasing length of messages, 
and it may depend on the initial condition. Thus, the problem of ergodicity acquires foremost importance 
in this approach. 


Our section 2 dealing with the problem of ergodicity is therefore developed in the framework of the 
second point of view. Once the nature of the ergodicity condition is clarified and this condition is as-— 
sumed to be fulfilled, then a smooth passage from the second point of view to the first becomes easy. 
Thus, our section 3 on redundancy can be interpreted in either point of view. 


It is not implied by the foregoing paragraphs that the problem of ergodicity is irrelevant to the 
first standpoint or cannot be formulated in the framework of this standpoint. The situation is that the 
nucleus of the problem under consideration can be exhibited more directly and naturally in the second 
point of view. 


The usual theory of Markoff's chains, which is based on transition probabilities from one state to 
another, is extended in this paper to the case where the probability Q(a, , .., ava } ay ) of symbol ay 
appearing in a message is dependent on the (VY - 1) immediately preceding symbols, v being the range of 
intersymbol correlation. A population of infinitely long messages is considered to be engendered solely 
by this intersymbol correlation probability: Q(a,, .., ays] ay) froma given ( Y - 1) -symbol initial 
sequence. The problem of ergodicity then pertains to existence of unique (i.e., independent of initial 
sequence), non-vanishing value of P(ay, .., ap-s ), which should give the probability that a (m- 1) 
-symbol sequence arbitrarily taken from the population is (a,, .., a pri Ms pA being not necessarily 
equal to Y . This generalized problem of ergodicity is discussed in our section 2. 


It is shown not only that finiteness of correlation range does not warrant ergodicity, as is often 
erroneously assumed in existing literature, but also that if <VY the quantity P can have more than 
one finite value depending on the initial sequence, a situation which does not exist in the ordinary 
Markoff chains. 


Under the conditions that guarantee existence of unique (whether or not non-vanishing) value of B, 
a convenient quantity, called correlation index W,, defined by Eq. (31), is introduced, characterizing 
both "range" and "strength" of correlation. First, it represents the "range", in the sense that the 
actual correlation range is the maximum value of M for which Wa # O. This criterion is both of 
theoretical and practical interest. Theoretically, this determines the applicability of the generalized 
theory of Markoff's chains, and practically, this can be used to measure the existing correlation range 
in a given population of messages. 


85 


Second, this quantity Wj, represents the "strength" of correlation, in the sense that W, quanti- 
tatively measures the decreasé of information due to the existence of -symbol correlation as compared 
with the (M - 1) -symbol correlation. Finally the so-called redundancy is expressed in the form of a 
compact series in ascending range-numbers of the correlation indices, Eq. (42). 


Ergodicity 


We assume the alphabet under consideration to consist of N symbols: S; , Sg,...S,y. We 
shall constantly use a mathematical symbol: 


Q COnypQs) ots Onl Dien ve pi Onis An) | 
where each one of a,, a3, . «, 2» Can be any one of the N_ symbols. 


(1) 


Definition I. The quantity denoted by (1) represents the probability that the last (n-m) symbols of 
a sequence of n_ symbols are (amin, .-, an) when it is known that the first m symbols of the 
sequence are (a, , «+5 am). 


By the very nature of probability, we have 
QCA ki Cel Gear en en ; Dy 2, QCG ++) Gm| Amery -+ 9 An) = L (2) 
Ames n 


If there is no correlation between symbols, the probability of any place in a sequence being oc- 
cupied by symbol S; is independent of the preceding symbols. As a result, the only quantity which 
determines a probability of the type (1) is Q(S;) which represents the probability of symbol S; ap- 
pearing at any one place. In this case, we have: 


Q Cay, a) Am | Amery os) On) = QCA mai) QC Amex) >: (Q( An) ’ 


If the correlation extends, for instance, over three consecutive symbols, and not more than three, 
then the probability of a place in a sequence being occupied by symbol S$; will depend on the two sym- 
bols directly preceding it, but not on the symbols beyond these two. This means that the quantities 
Q(Si, Sj | S,) determine the general probability (1): 


(Q Ca, 5 On | Ome, °° » Cn) = QCAmer sae Qineai) QCOmy Gms Guess) os Q(Aanr, An- | An) F 


In general, we have the following theorem: 


Theorem I, If the intersymbol correlation does not extend over more than fA consecutive symbols in 
a sequence, we can factorize (1) as follows: 


QO, ++, Gm | Amar, e+, An) 


= Q (Gin per, ory CeliGmat)@ (Om-na93,++ , One | Onin ese Qn pti y 17 Ans Qn) (3) 
This theorem can be used to define the "range-number" of intersymbol correlation: this number 
is the minimum allowable Pp in the decomposition (3). 


Assuming the correlation to be of range Y , we consider all the possible sequences whose first 
( V -1) symbols are given to be, say, (a, , 42, .-, ay-+ ). Among these sequences starting with 
(ay, 2g, «. aps ),y we inquire the probability of those sequences whose first V symbols are 
(a, , dy, ba, «ey byy ). This probability is obviously given by 


RC, 92, “+, Ayr | On ee ) = QC 41, Aa, - Ayr] by, ) ? if (a, , +, Ay-1) = Cb,,-- , by-2 ) , 
and otherwise R Casa; 5°94 fbysbkes, Dyes) AO; 


In other words, the probability in question can be written in a matrix form: 


(04,02, eae ay, |R| De bs, 32?) = Q(a,,-- , Ay-1 | bv-1) d(a2,b,) 6 (a3 , D2) Maas 8 Oye, Dy. ) 5 (4) 
with 
§(S;,5;) =0 if i#) BAS a5 Sa JeAod if v=} . 
Using this matrix-expression, the probability, in the above population of sequences, of a particular 


sequence (b, , b. . . by-y ) appearing in such a position that the place distance between a; and b, 
is m symbols can be given by 


oe (Oy ad pOy=1Dranisin et = aenrige ,) mie t by, a Dy, ) , 


where R@ simply means the m-th power of R in the sense of matrix-multiplication, 


(5) 


With the help of the quantity (sy; we can further calculate the probability of a given sequence 


of any length (p - 1), say (b,, .., ba. ), appearing at any position after the initial Oo ete ii serfs 


86 


sts pry this probability will be 
Ce | b,, “tie bp-1) 5 Tt, Gay “ Ay-1| by ,2+, by, ) Q (b,,-+/ by.1|by) QC dy-w,* + O4-2] but) (6) 
where m stands for the symbol distance between a, and b,. 

If M<V , we have 

TCO) Ancy| Di»: ban) os 2 aps Lao ans Gon bab ahanb erage), (7) 


Ps 


where m bears the same meaning, 


Now, the average probability of sequence (b,, .., bat ) with the "place-distance" not larger than 
m will be 


uP Gos , ayrilb... , ba-)= + my TO Cay, ++) Ayal bs, i Pd, bu-) . (8) 


We now proceed to define what we mean by ergodicity in this paper. We consider all the possible, 
infinitely long sequences which start with a given initial sequence (aig Me yt Sane) end. ask,the average 


probability of the sequence (b,j, .., Dart ) appearing in any position. This probability evidently has 
the mathematical expression: 


in. ue” Cate brag he | b,, vey Dy), 


m > 00 (9) 


The word average here implies a two-fold averaging, viz., first, averaging over all the possible se- 
quences with a fixed position where the sequence (b, , .., Dua ) should appear, and second, averaging 
over all the possible positions of this sequence. The first averaging is mathematically represented by 
the matrix multiplication in (5), and the second averaging by the summation in (8). 


Definition II. If tim Ula, » ee) ava} b,, «6, Dm ) converges to a unique, non-vanishing limit 
independent of (a, .+, ay-, ), where (a,, .., aye, ) can be taken arbitrarily from a certain 
family of (yv - 1) -symbol sequences and (b, , «-, bua ) can be taken arbitrarily from a certain 
family of Cp - 1) -symbol sequences, then we speak of ergodicity with regard to these families. 


We shall presently see that the quantity (9) with a fixed initial sequence (a, , .., ay.) anda 
fixed final sequence (b,, ., Da-l ) indeed converges to a limit, say: 


(02) 
U™ (ar. Ovi | bis, bm), (10) 
but this limit is not necessarily larger than zero, nor is 1t in general necessarily independent of the 


initial sequence. In order to understand clearly the situation, let us invoke some well-known mathemati- 
cal theorems regarding the Markoff chains. 


The ordinary Markoff chain formally pertains to a two-symbol correlation probability (#|R\ B } 
COR el, 25s cedaaMinn (OAR AB. uz; 1, EATS Need Cie nsec ( 
In accordance with the usual rule of matrix multiplication, we further introduce 


Seo aay REO CURIA (ul R 1p) (12) 
x ee 


11) 


m 
Then, we have the following theorems: . 


Theorem II, The quantity defined by a ml ey 
= — (a 
U (418) = SCS) 
for any given pair ( &,( ) converges to a limit as m> ©: 
ee CUT (>. (14) 
Theorem III, The entire set G of symbols (@=1,2, .., M) can be divided into a "vanishing" sub- 
set Vand a certain number of "closed" subsets C4(i = 1,2, ..) in such a way that 


uo Ga | Pp )=0 for 4 belonging to G, and for B belonging to V, 
Ue Gch) for & and f belonging to the same C,, 
Or Cal p in for « and p belonging to different C's. 


(13) 


Hoey, 
(2) Si 


87 


Theorem IV. U™ ( al B ) is independent of 4 , if « and B belong to the same C. 


Coming back to our original topic, if the correlation-range-is two, andif P~=VP , these theorems 
can be directly applied to our problem involved in Definition II. If the correlation-range is > 2, we 
only need to consider a sequence of (VY - 1) symbols collectively as a symbol “ . The R's defined in 
(4) indeed satisfy (11). The cases: M#¥V _ can be handled with the help of (6) and (7). 


From Theorem II follows quite generally: 
Theorem V. The limit (10) exists. 


We shall now discuss first the case M*=¥V in the light of Theorems II, III and IV. According to 
Theorem III, the entire set of ( Y - 1) -symbol sequences is subdivided into a vanishing subset V_ and 
a certain number of closed subsets Ci. If the final sequence of (10) belongs to V, then U Ses 
zero independently of the initial sequence. For a given final sequence belonging to one of the closed 
subsets, U ‘S®? will be zero if the initial sequence belongs to another closed subset, and will have a 
constant non-vanishing value insofar as the initial sequence belongs to the same closed subset as the 
final sequence. Thus: 


Theorem VI. When bev » ergodicity in the sense of Def. II holds if and only if the initial 
family and the final family are the same closed subset. 


In the cases where >7V , we construct an "extended" closed subset D; of (# - 1) symbols by 
taking those ( M - 1) -symbol sequences (b;, .., by-: ) whose first (Vv - 1) symbols coincide with one 
of the members of the ( ¥ - 1) -symbol closed subset Cj; and which satisfy the condition: 


Q (bs, oe sby-1| by) Q(b.,--, by] bya) =a ka Cop-v, a Da) Dart ) #0. Cio) 
The extended vanishing subset will be composed of all those ( - 1) -symbol sequences whose first 
( ¥ - 1) symbols coincide with one of the members of the ( Y - 1) -symbol vanishing subset, or whose 
first ( VY - 1) symbols coincide with one of the members of some closed subset but whose last (m-—v_  ) 
symbols violate the condition (15). The entire set of possible (Mm - 1) -symbol sequences are thus 
covered by the D's and V, and there is no possible overlapping. If the (pM - 1) -symbol final se- 
quence of (10) is a member of this extended vanishing subset, U‘*) will certainly vanish whatever the 
initial sequence may be. If the final sequence belongs to an extended closed subset D;, then a 
will vanish for an initial sequence belonging toa C; different from the one, C;, which corresponds 
to D;, and will have a constant non-vanishing value tor any initial sequence belonging to Cj. 


Theorem VII. When ¥ <M , ergodicity holds if and.only if the initial family is one of the closed 
subset C3 and the final family is the extended closed subset Dj corresponding to C;. 


In the cases where oe » we encounter a rather peculiar situation. From a closed subset C; we 
construct a retrenched subset E; of (# - 1) -symbol sequences. E; is the set of those (M - 1) sym- 
bols sequences which coincide with the first ( mM - 1) symbols of at least one of the members of Cee Lne 
retrenched vanishing subset is defined as the totality of all those ( M - 1) -symbol sequences which do 
not belong to any one of the retrenched closed subsets. In case of the extended closed subsets, a given 
sequence of (M- 1) symbols could not belong to more than one D;, since the division made in Theorem 
III does not allow for any overlapping. However, in the present case of retrenched subsets, a given 

( # = 1) -symbol sequence may well belong to more than one E. If the (- - 1) -symbol final sequence 
of (10) belongs to the retrenched vanishing subset, U‘®’ will always vanish. If the (M - 1) -symbol 
final sequence belongs to §;, ay --, E,, then U‘? will be zero for an initial sequence belonging to 
a C different from any one of the corresponding subsets: C3, C:, .., Cy. For the same final sequence 
U “*) may thus have different non-vanishing values according as t3 which one of Cj, C5, «+, Ck the 
initial sequence belongs. 


Theorem VIII. When #M<‘Y , ergodicity holds for the initial family identical with one of the 
closed subset Cj and the final family identical with the corresponding retrenched subset Ej. 


In the foregoing considerations, we have systematically omitted the initial sequences belonging to 
the vanishing subset V. The reason for this is that the U‘ depends in this case on the detailed 
structure of the intersymbol correlation, and that we cannot draw a conclusion of general validity. (Of 
course, if the final sequence also belongs to V, then U‘% vanishes). 


Regarding the closed subsets of (¥ - 1) symbols, we should like to mention the following interest- 
ing property. We have obviously 


88 


Ua Gyn by et by) = z uUSehwhta,.., | bs )-+/ By) Q(by,-+s Denil by) 


whence we infer: 


Theorem IX. (ba, b3 , .. by) is a member of C;, if there is any symbol b, such that (b,, ba,.. 
as ) is a member of Cj and Q(b1, ba, SeGlibiue (ews) ah) OF 


For a given (b: , b2, .., by-)) there mst be at least one bw such that Q(b, , ba, se) by | bv) ¥ 0, 
on account of (2): Hence: 


Theorem X. If (b, , by 4 «6, by, -) is a member of C 


then there is always a member of C; whose 
first (y - 2) symbols are (b,, .., by- ‘i 


162) 


Before closing this section, a simple illustration may be given. Suppose the alphabet to be com- 
posed of three symbols: 5S}, So and S35 and to have an intersymbol correlation of range 3: 


Q(S: , S. | Ss.) = Ae QtSis So J si) = Ly 
OCS, > S3 | Ss.) = 1, OCSh, S, iS 3) = ibs 
OAS.5 Se | S2) ‘= 1, (5) PR S3 J Se) = by 
Q(S3, Si | S,) =1, Q(S;5 S2 1s,) =l, 
(Sj, v5 pe} Spek (is 


Then the (V — 1) symbol subsets are: 


C29 + (6137 Sy) 

C. 3 (Sj, Say (Sa, S,) 

3° 2° (Sz,-S.) 

V ore (34 S3), S55 $15 (S., S3), (S35, Sa) (S3, S; ) 


The extended 3-symbol subsets are: 


Di, 3 (Sy Sty Se) 

Dit (S,,; Say Sy); (S2,5,,52) 

Dre§ (S3 Say So) 

v’ : all other 3-symbol sequences 


The retrenched l-symbol subsets are: 


By, & S; 
id 20! Sang-Sa 
Haesten (Sy 
IRE 7 


We can see the overlapping we have discussed; as a result, U ‘ with the final sequence (symbol) S,, 
for instance, becomes three-—valued; 


109) 
CSiest 3 4 } S,) =l 
a. (ts'3; S2 | S?) =3 
yi Si5 Sige) Si*) = 
Use (Sees sl Se) Yer0 
All other U°(|S,) =1 


89 


Redundancy 


In this section, we shall constantly use a quantity denoted by: 


PCO, Os feel On) 20. (17) 
e 
Definition III. The quantity (17) represents the probability, in infinitely long messages, of an 
arbitrarily taken sequence of symbol-length n being a particular sequence (a;, aj, .., an). 


From this definition follows the normalization condition: 


Z--2 PCy, Gr,--, an) =1 . (18) 
a, an 
According to the point of view of the last section, the existence of a unique value of such a pro- 
bability is not unconditionally guaranteed. Only if the initial sequence (b,, .., by—-: ) is limited to 
within a closed subset, say, C;, then 


US Cor, «=, Py [Os,+-,On)° 


becomes independent of (b,, .., by-, ), i.e., a function only of (a,, .., an). If this is the case, we 
can write 


CB by ae On) ae ee) : (19) 


According to the theorems of the last section, if (a,, .., an) belongs to C;, or its extended subset 
Dj, or its retrenched subset E;, P will be finite, and otherwise zero, We have therefore to restrict 
the "infinitely long messages" of Definition III to only those which start with initial sequences belong- 
ing to one closed subset. The condition regarding P does not require that all the P's should be non- 
vanishing, thence the restriction on the final sequences, in the sense of Definition II, is not necessary, 
On account of ergodicity, two sequences starting from two different initial sequences of the same closed 
subset becomes, in the long run, statistically identical, It is true that we can evade the restriction 
on the initial sequences by giving a certain "weight" to each of the closed subsets, which would lead to 
a unique value of each P. However, from the point of view that the messages are engendered solely by 
the correlation probability, this alternative is not acceptable, since it involves an arbitrary "weight" 
of each closed subset. Our discussion of this section will be based on the assumption that the initial 
sequences are limited to a single subset. The generalization of the results to the case of "weighted" 
subsets is very simple. 


It’ should be noted that, as a result of the limitation of the initial sequences to a single subset, 
it may well happen that some of the generally possible sequences (a,, .., av-; ) in the correlation pro- 
bability Q(a,, .., ay_,] a, ) actually never happen in the possible messages. Thus the actual range of 
correlation may become smaller than the range defined with regard to the entire possibilities of the a's. 
For instance, in the illustration of the last section, if we limit ourselves to the initial subset C2 , 
all 3-symbol Q's except Q(S.1, S2 | S,) = 1 and Q(S., S, | S2) = 1 will become meaningless. These two 
3-symbol correlation probabilities reduce to the following two 2-symbol correlation probabilities: 

Q(s, 1S.) =1, and Q(S2 | S,) =1. The range is thus reduced from three to two. 


In the empirical point of view, if a population of very long sample messages is given, we can al-— 
ways evaluate (17) by just counting the frequency of each segment (a,, .., an). However, if we divide 
this entire population into, say, two groups, the values of (17) may be different in the two groups. 
This discrepancy may be caused by a difference in correlation probabilities and/or by a difference in 
the initial sequences. We thus see that the problem of ergodicity is not irrelevant to the empirical 
point of view. In this section, however, we assume that we have a single population from which the 
quantities of the type (17) are uniquely determined, 


The quantity (17) has, besides (18), the property: 


DP (ai,--, Ak, br, +++ Dm» Anemery ++ » An) il Silad Seer Dare (20) 


This is obvious from the statistical point of view, but can also be verified from the standpoint of (19). 


According to (6), we have for n2y 


IP ey oy, oie PCA, 00) Ayr) ACAr, ++, Ov1 | Ay) > Q(An-ys, ot) Ane | an) j (21) 


90 


or more generally, 


PCO .5 An) = PG, oo 1 Apes) VEG, ++, Ap | Aw) > QCAn-per pee, AnrlAn), (22) 


provided N2M@2V , Equivalence of (21) and (22) can readily be seen with the help of (3) and (6). In 
particular, for n=p2y , we get from (22) 


Ne PiGagr- ma. ) 
CO Ea beer re a . (23) 


This is just what should be according to Definitions I and III. (23) may be considered as the defi- 
mitionvor Q(a,, .», auilay ) even for M<Y¥ . However, with such Q's with pa<v, (22) will not be true, 
since the Q's with psy cannot describe fully the existing correlation. 


Substituting (23) into (22), we get 


pices sananis. oF Cuan tn) Eva tae En) nik Steam 

P(Q2,-° On) ay POnps, pol pOn-1 ) 
provided N7Mzy, The actual range Y is thus the mini value of for which the decomposition (24) 
is allowed. imum 


, (2h) 


For an allowed value oi ee if a further decomposition of range M- 1 is still allowed, i.e., if 
-}zv , then we get from (24 
P ie (a, . Ap) = PCr, ++, Ga-i) BLA = a “) PCr, ., An) 

TP Pet 
Pont oy On- ) (25) 


for all (a,, .., ag). But if M-!<V, the left side of (25) will not be equal to its right side for at 
least one sequence (a,, .., aa). Thus we are led to use (25) as a criterion to determine whether M>v 
or not: If (25) holds for all (a,, .., aa), then w7V3 if not, wev . Indeed, if (25) is possible, 
we have in virtue of (23), 


PGi OD) P(Q2,++» An) ; 
Me a a a EE Se ek. = SNE ee = 
Q 04, > Ope LO) BCs 121) PCO dan) OO aera (26) 


i.e., Q of rangepP is reducible toa Q of range (M -1). In the light of Theorem I, this means that 
the actual range is (@ - 1) or less. If (25) breaks down for at least one sequence (a,, .., aw), then 
(26) does not hold in general, meaning that the actual range is larger than (pt - 1). 


Theorem XI, If and only if (25) holds for all (a,, SOG apm)s the actual correlation range V is 
( a- 1) or less. 


This criterion is interesting particularly in the empirical point of view, for here the P's, in- 
stead of the Q's, are the quantities which are primarily given. The criterion of Theorem XI can be 
brought to a more concise form by the help of the well-known theorem attributed to W. Gibbs: 


Theorem XII. If 
fi Bor; qi 20 ’ and CN soa eK CE RR et (27) 


then 
WED filegfi-Z flog qr 20, 
(28) 
where the equality holds only when f; = g; forall i. 
Now, let us call the left-hand side and the right-hand side of (25), respectively 
F(Gr,-- Gp) = Plas, ++, Wm) a 
P(A,--» Spi) PCar,--, Op) 
PO a ne) a egerea oene y * 
PCaz, eet Gp- ) 


and consider the index i of Theorem XII as a collective index for various possible sequences of symbol- 
length pA. On account of (18) and (20), the conditions (27) are satisfied, and we obtain 


Wa = J, P(%,--, 4p) Lo Pare ap) ~ LZ PCO, 06, Anns ) doa Cr Qu) 

+E PCO, .- 4 Ane) fy (Qry2 1 Spa) 20, (31) 
Only when (25) holds for all (a,, .., ap)s then Ws» = 0, In other words, for a given value of v, 
Wee O for pv . This leads to a convenient way to determine the actual range: 


Theorem XIII. The actual range V is the maximum value of M for which Wa FX oO. 


91 


The W's defined by (31) will be called "correlation indicies",. 
For M@ = 2, the definition of W, in (31) should be understood as meaning 


Wry: PCO 02) bog PCA, , G2) ~ Fey EC Oy) fen PCa), (32) 


for we have here g(a,, a,) = P(a,)P(az). 


We shall now proceed to find out the average amount of information carried by a message-segment of 
length n ina language in which the P's exist. A specific message-segment Ca pete a,) has proba- 
bility P(a,, .., an). Thus the information per symbol carried by this message-segment is 

~— log P(a,, «+, an). 
The probability of occurrence of such a message being P(a,, .., an), the average information per symbol 
for various possible message-segments of length n is given by 


I,=- £2 PCA, --, On) bog Play,.., On). (33) 
Now, if the existing correlation is of range Vv , the P can be decomposed as in (24) with M=Y . A 
straightforward calculation with the help of (18) and (20) gives 
fits = dep = ~* (nv +1) CO cies ay) tog PCa, -, Av) oe (n-v) > PCa, a Gy, ) dog P Cai, rey Ayn ) . (34) 


for an obvious reason this y can be the actual minimum range or anyY that is larger than this. Suppos- 
ing V in (34) to be the actual minimum range, let us find the error which would be committed by the cal- 
culation based on the assumption that the actual range were Y- 1. This is easily found to be 

n,v n,V~\ n Page (35) 


Repeating this process, we obtain 


° * -pr+l 
Leal ene ae aT » Let ol 


BK, (36) 
a8 T° In, =— ZL PCa) hog P Ca) Gr 
Since Wp vanishes anyway for M7vV_ , we can state: 

Theorem XIV. The avanede information per symbol carried by a message-segment of length n is 
I= 1-2 SEY w (38) 


insofar as n is larger than the actual’correlation range. 


Since the W's are zero or positive, the intersymbol correlation tends to decrease the amount of in- 
formation, Thus, Wy, can be considered to represent the "strength" of correlation — strength in the 
sense of reducing the amount of information. By definition, I, cannot be negative, thence there is an 
upper limit to the total "strength" of the correlation: 

n-Atl < fe < T° 

ibecrrn A facut IB Re aS . (39) 

For n>7V , we obtain from (38) 1,#1I,= [°= Row negecin eats) 
par / 2 (40) 

showing that if we take a sufficiently long segment as a unit, the information per symbol becomes indepen— 
dent of the length of the segment. This indirectly justifies the usual procedure according to which an 
infinitely long message is cut into segments of sufficient length and the segments are treated as if they 
did not have any correlation among them. 


The quantity called "redundancy" is defined by 


rR =(1°-1.)/1°. (41) 
Theorem XV. The redundancy of a language which is characterized by the correlation indices We is 
given by R=(4/1°) Z Wa ; Saigi< iat (42) 
In the illustration of the last section, if we limit the initial sequences to C, we get 
Wi = fo9 2s Wet Wye =O, Xs fem 2G aD ogiO 2) 4 ER =10C% 
This last result is not surprising, because the possible infinite sequences are limited to: ..S,S,5S,S,.., 


which certainly cannot convey any information. : 


-- - Sep —-_—_- —- —.—, 


1.” See for instance W.Feller, Introduction to Probability Theory and its Applications.(Wiléy,N.Y.O)p307fe. 
2. Stanford Goldman, Information Theory (Prentice-Hall, N.Y. 53) b 45. 
92 


MULTIVARIATE INFORMATION TRANSMISSION* 


William J. McGill** 


MASSACHUSETTS INSTITUTE OF TECHNOLOGY 


RESEARCH LABORATORY OF ELECTRONICS 
and 


LINCOLN LABORATORY 


ABSTRACT 


A multivariate analysis based on transmitted 
informationis presented. It is shown that sam- 
ple transmitted information provides a simple 
method for measuring and testing association 
in multidimensional contingency tables. Rela- 
tions with analysis of variance are pointed out, 
and statistical tests are described. 


*The research in this document was supported jointly 
by the Army, Navy, and Air Force under contract with 
the Massachusetts Institute of Technology. 


**Staff Member, Lincoln Laboratory, Massachusetts 
Institute of Technology. 


93 


MULTIVARIATE INFORMATION TRANSMISSION* 


I. INTRODUCTION 


Several recent articles in the psychological journals have shown how ideas derived 
from communication theory are being applied in psychology. It is not widely understood, how- 
ever, that the tools made available by communication theory are useful for analyzing data, wheth- 
er or not we believe that the human organism is best described as a communications system. 

This memorandum will present an extension of Shannon's!” measure of transmitted 
information. It will be shown that transmitted information leads to a simple multivariate analy- 


sis of contingency data and to appropriate statistical tests. 


II. BASIC DEFINITIONS 


Let us consider a communication channel and its input and output. Transmitted in- 
formation measures the amount of association between the input and the output of the channel. If 
input and output are perfectly correlated, all the input information is transmitted. On the other 
hand, if input and output are independent, no information is transmitted. Naturally, most cases 
of information transmission are found between these extremes. There is some uncertainty at the 
receiver about what was sent. Some information is transmitted and some does not get through. 

We are interested not in what the transmitted information is, but in the amount of in- 
formation transmitted. Suppose that we have a discrete input variable, x, and a discrete output 
variable, y. Since x is discrete, it takes on values or signals k = 1,2,3, ... X with probabilities 
indicated by p(k). Similarly, y assumes values m= 1,2,3, ... Y with probabilities p(m). If it 
happens that k is sent and m is received, we can speak of the joint input-output event (k,m). This 
joint event has probability p(k,m). The rules governing the selection of signals at either end of 
the channel must be constructed so that 

k=X m=Y 
= p(k) = 2, ‘“plm).s. 2. p(k,m) = 1 
k=1 m=1 k,m 
Under these conditions, and if successive signals are independent; the amount of information 


transmitted in "bits" per signal is defined as 
T(xsy) = H(x) + H(y) - H(x,y)  , (1) 
where 


H(x) = ee p(k) log, p(k), 
H(y) = -2=p(m) log, p(m) 
m 


H(x,y) = -— 2 p(k,m) log, p(k,m) 
k,m 


One "bit" is equal to -log, (1/2) and represents the information conveyed by a choice between two 


equally probable alternatives. Our development will use the bit as a unit, since this is the 


* Several of the indices and tests discussed in this paper have been developed independently by J. E. Keith Smith 
at the University of Michigan, and by W.R. Garner at Johns Hopkins University. 


9h 


convention in information theory, but any convenient unit may be substituted by changing the base 
of the logarithm. 

If there is a relation between x and y, H(x) + H(y) > H(x,y) and the size of the inequal- 
ity is just T(x;y). On the other hand, if x and y are independent, H(x,y) = H(x) + H(y) and T(x;y) 
is zero. It can be shown that T(x;y) is never negative. 

The presentation to this point has been an outline of the properties of the measure of 
transmitted information as set forth by Shannon 1° These properties may be summarized by stat- 
ing that the amount of information transmitted is a bivariate, positive quantity that measures the 
association between input and output of a channel. There are, however, very few restrictions on 
how a channel may be defined. The input-output relations that occur in many psychological con- 
texts are certainly possible channels. Consequently, we can measure transmitted information in 


these contexts and anticipate that the results will be interesting. 


Iii, SAMPLE INFORMATION 


Our development will be based on sample measures of information, i.e., on measures 
of information constructed from relative frequencies. 
Suppose that we make n observations of events (k,m). We identify Nm 3s the number 


of times that k was sent and m was received. This means that 


ny = 20 , 
m 
OT ohm % 
ue Pd fo ; 
oon km 
where ny, is the number of times that k was sent, ne is the number of times that m was received, 


and n is the total number of observations. A particular experiment can then be represented by a 
contingency table with XY cells and entries Xm 
We may estimate the probabilities p(k), p(m) and p(k,m) with n/n, n/n and Nn/2 


respectively. Sample transmitted information T'(x;y) is defined as* 
T'(xsy) = H'(x) + H'(y)- H'(x,y) (2) 
where H'(x), H'(y) and H'(x,y) are constructed from relative frequencies instead of from proba- 


bilities. As before, T'(x;y) is the amount of transmitted information (in the sample) measured 


in "bits" per signal. 
Since it is difficult to manipulate logs of relative frequencies, we will introduce an 


easier notation: 


1 
Pemet netic tke Unies + 
k,m 


* Throughout this memorandum, a prime is used over a quantity to indicate the maximum likelihood estimator of the 
same quantity without the prime. For example, T (u;y) is an estimator for T(u;y). 


95 


s= log, n 


Expressions involving sample measures of information are easier to handle in this 


notation. For example,T'(x;y) becomes 
' e = 
Sexes y,) B68 TiS ys Sy iat Puppia! (3) 


Equations (2) and (3) are equivalent expressions for T'(x;y). When we write equations 
like (3), we shall say that these equations are written in s-notation. Thus Eq. (3) is Eq. (2) in s- 
notation. 


IV. THREE-DIMENSIONAL TRANSMITTED INFORMATION 


Now let us extend the definition of transmitted information to include two sources, u 


and v, that transmit to y. To accomplish this, we replace x in Eq. (2) with u,v and we find that 


TW vVey) =o (us vy ote (y) — 2 (0, vay ee (4) 


where x has been subdivided into two classes, u and v. The possible values of uarei=1,2,3, 
..+.U, while v assumes values j = 1,2,3, ...V.. The subdivision is arranged so that the range of 
values of u and v jointly constitute the possible values of x. This means that the input event k 


can be replaced by the joint input event (i,j). Consequently, we have 


eas ? 


and the direct substitution of u,v for x in Eq. (2) is legitimate. 
Our new term, T'(u,v;y), measures the amount of information transmitted when u 


and v transmit to y. It is evident, however, that the direction of transmission is irrelevant, for 


examination of Eq. (4) reveals that 
T'(u,vsy) = T'(ysu,v) 


This means that nothing is gained formally by distinguishing transmitters from receivers. The 
amount of information transmitted is a measure of association between variables. It does not 
respect the direction in which the information is traveling. On the other hand, we cannot per- 


mute symbols at will, for 
T'(u,y;v) = H'(u,y) + H'(v) - H'(u,v,y) : 
and this is not necessarily equal to T'(u,v;y). 
Our aim now is to measure T'(u,v;y), and then to express T'(u,v;y) as a functic.. of 


the bivariate transmissions between u and y, and v and y. Computation of T'(u,v;y) is not diffi- 


cult. Our observations of the joint event (i,j,m) organize themselves into a three-dimensional 


96 


contingency table with UVY cells and entries n, 


ijm’ We can compute the quantities in Eq. (4) from 


this table, or we can write 


' 3 eae ten 
Ea viy)i="s a Sij + Siim (5) 


where 
Shee ey oF Neer logs is. ; 
ijm on i,j.m ijm 2 ijm 
and other s-terms are defined by analogy with the s-terms in Eq. (3). 
Now suppose that we want to study transmission between u and y. We may eliminate 
vin two ways. First, let us reduce the three-dimensional contingency table to two dimensions 


by summing over v. The entries in the reduced table are 


[oly 725 Talc 
im j ijm 
We have, for the transmitted information between u and y, 
Wer = = a 
T(wiy)=s-—s,-5 +s, - (6) 


The second way to eliminate v is to compute the transmission between u and y separately for each 
value of v, and then average these together. This transmitted information will be called Ti j(usy), 
where 
n. 
Ti(usy) = Doj(Ti(usy)] , (7) 
j J 
and Tas) is information transmitted between u and y for a single value of v, namely, j. It is 


readily shown that 


! nee) = ee 2 : 
TY (usy) = ek Si; Sim + Shim : (8) 


We see that T (uy) is written in the same way as T'(u;y) , except that the subscript j is added 
to each of the s-terms. 
There are three different pairs of variables in a three-dimensional contingency table. 


For example, the two equations for transmission between v and y are written 


T'(vsy)=s—- 5.2m + eee (9) 


Ti (vey) teal iee Sie Sijm : (10) 


Finally, we may study transmission between u and v, i.e., 


T'(usv) Be S85 4 84 ; (11) 


Tus) SiS 5 = 8; fi— Sim + Siim : (12) 


With these results in mind, let us reconsider the information transmitted between u 
and y. If v has an effect on transmission between u and y, then TY (uy) # T'(u;y). One way to 


measure the size of the effect is by 


ah 


Alivy}e TE (apy) oR (ary) > 
ANOVY) SSE Se Ee oe es =) Gent Si é (13) 


A few more substitutions will show that 


A'(uvy) =)T) (usy) = T(uzy) > 


NW 


Te vey peal (y ye 


T eleva ue) . (14) 


In view of this symmetry, we may call A'(uvy) the u . v . y interaction information. We see that 
A'(uvy) is the gain (or loss) in sample information transmitted between any two of the variables, 
due to additional knowledge of the third variable. 

Now we can express the three-dimensional information transmitted from u,v toy, 


i.e., T'(u,vsy), as a function of its bivariate components, for 
T'(u,vsy) = T'(usy) + T'(vsy) + A’(uvy) (15) 
Ti vey)= To (asyyt (Vey) AUvy) a (16) 


Equations (15) and (16) taken together mean that T'(u,v;y) can be represented by a diagram with 

overlapping circles as shown in Fig.l. The diagram assumes what we shall call "positive" inter- 

action between u,v and y. Interaction is positive when the effect of holding one of the interacting 
variables constant is to increase the amount of association between 
the other two. This means that T' (us y) Sh (usty) ane Ti vysy) > 
T'(v;y). (Because of Eq. (14), if one of these inequalities holds, both 
must hold.) Later on, however, we shall show that interaction may 
be negative. When this happens, relations between the interacting 
variables are reversed, and the diagram in Fig.1 is no longer strict- 


lyscornect. 


Fig. 1. Schematic diagram of V. COMPONENTS OF RESPONSE INFORMATION 

the components of three-dimen- 

sional transmitted information. The multivariate model of information transmission is 
The diagram shows that three- 
dimensional transmission can be 
analyzed into a pair of bivariate are not the same as those with which we deal in psychological appli- 
transmissions plus an interaction 
term. The meanings of the sym- 
bols are explained in the text. mission from a single information source. He knows the statistical 


useful to us because the situations treated by communication theory 
cations. The engineer is usually able to restrict himself to trans- 


properties of the source, and when he speaks of noise he means ran- 
dom noise. This kind of precision is seldom available to us. In our experiments we generally 
do not know in advance how many sources are transmitting information. We must therefore be 
careful not to confuse statistical noise with the experimenter's ignorance. 
The bivariate model of transmitted information provided by communication theory 


‘ells us to attribute to random noise whatever uncertainty there is in specifying the response when 


98 


the stimulus is known Consequently, if several sources transmit information to responses, the 
bivariate model will certainly fail to discriminate effects due to uncontrolled sources from those 
due to random variability. On the other hand, the multivariate model can measure the effects due 
to the various transmitting sources. For example, in three-dimensional transmission we find 


that 


H'(y) = Hi oy) + T'(usy) + T'(vsy) + A'(uvy) © , (17) 
where H'(y) =s — Sin and Hh yly) = Si ~ Siam’ 

We see that H'(y), the response information, has been analyzed into an error term 
plus a set of correlation terms due to the input variables. The error term Healy) is the residual 
or unexplained variability in the output y after the information due to the inputs u and v has been 
removed. In bivariate information transmission, the response information is analyzed less pre- 


cisely. For example, we may have 
H'(y) = Hi (y) + T'(usy) (18) 


In this case, the error term is H' (y) because only one input, u, is recorded. 
10 . 
Shannon showed that 


H',(y) > AY) 


In other words, the error term, when only u is controlled, cannot be increased if we also con- 


trol v. In fact, 
i] — 1 i] ° 
Ay) = ah ey) very). (19) 


Equation (19) is proved by expanding both sides in s-notation. Thus, if u and v are 
stimulus variables that transmit information via responses y, we have an error term H',(y), pro- 
vided we keep track of only one of the inputs, namely, u. However, this error term contains a 
still smaller error term, as well as the information transmitted from v. Controlling v is thus 
seen to be equivalent to extracting the association between v and y from the noise. Multivariate 
transmitted information is essentially information analyzed from the noise part of bivariate trans- 


mission. 


VI. AN EXAMPLE 


The kind of analysis that multivariate information transmission yields can be illus- 
trated by a set of data obtained from one subject in an experiment on frequency judgment. 

Four equally loud tones — 890,925,970 and 1005 cycles per second — were presented 
to the subject one at a time in random order. Each tone was 1/2 second long and separated by 
about 3 seconds from the next tone. During preliminary training, the subject learned to identify 
the tones by pairing them with four response keys. In experimental sessions, a loud masking 
noise was turned on, and a random sequence of 250 tones was presented against the noise back- 
ground. A flashing light told the subject when the stimulus occurred, and he was instructed to 


guess, if in doubt, about which one of the four tones it was. 


99 


One object of the experiment was to find weights lor both the {requency stimulus and 
the immediately preceding response in determining which key the subject would press. Tests 
were run at several signal-to-noise ratios. The data presented here were obtained when the 
signal-to-noise ratio was close to the masked threshold. 

In order to calculate weights, we can consider the experiment as an example of three- 
dimensional transmission. Our analysis is based on the responses to the 125 even-numbered 
stimuli. The odd-numbered responses are considered as the context in which the subject judged 
the even-numbered stimuli. The odd-numbered stimuli are ignored in this analysis. 

The stimuli will be designated as the variable u. Last previous responses are called 
"presponses" and they will be indicated by the variable v. These are the inputs. Current re- 
sponses are represented by y. This is the output variable. Thus we can identify the joint event 
(i,j,m) as the occurrence of response m to stimulus i, following presponse j. Failure to respond 
is considered as a possible response. Consequently, there are four stimulus categories and five 
response categories. 

The subject's responses to the 125 test stimuli were sorted into a4~- 5 + 5 contin- 
gency table. Two of the reduced tables that were obtained from this master table are repro- 
duced here in order to illustrate our computations. For example, the stimulus-response plot in 


Table I has entries Den The calculation for Sim 80eS as follows: 


1 

ean 14 [l log, 1 + 5 log, 5 +12 log, 12 +....... + 7 log, 7 + 10 log, 10g 
. © 374,05750/125, _, 
1m 

s.. = 2.99246 
im 


In the same way, Sim is computed from the figures for Xm in the presponse-response table 
(Table II). 


TABLE | TABLE II 
STIMULUS-RESPONSE PRESPONSE-RESPONSE 
FREQUENCY TABLE FREQUENCY TABLE 

Stimulus Presponse 


0 0 
Oe 
° 
© e 3 
ma 3 
4 + 


100 


1 
= 725 [! log, 1 +1 log, 1 +2 log, 2 + oh: a +9 log, 9 + 3 log, 3] : 


jm 

Sie = 372 38TLOVAZS oy, 
jm 

Ss. = 2.97910 

jm 


We obtain the value for S; from the n; in the bottom marginal of Table I: 


1 
S,= {25 (3i-log, 31 + 30 log, 30 + 33 log, 33 + 31 log, 31], 
Ss, = 620.83188/125__, 

S. = 4.96665 


The computation for s is based on the total number of measurements: 
S = log, 125 = 6.96579 


It is evident that these calculations are performed very easily with a table of n log, n. 
If he wishes, the reader may also make the computations with tables of p log, Pp like those pre- 
pared by Newmar® and Dolansky.? The use of p log, p tables for analyzing discrete data is not 
recommended, however, because it leads to rounding errors that the table of n log, n avoids. 


The complete set of s-terms in the experiment on frequency judgment worked out as follows: 


Soames 1 452151 S. =74. 96665 
ijm i 

San Pails) S, = 272) 
qj J 

S. =e IZA6 Suu = 4293380 
im m 

Se = 2.97910 s 7= 6.96579 
jm 


In Sec. V it was shown that response information H'(y) can be analyzed into components 
Hilly) o Hos Gy) Ti (usy) tT (vey) ee Gy ee (17) 


Since H'(y) =s — Sm’ We see that H'(y) = 2.03199 bits. If the subject had used the four response 
keys equally often, this figure would have been at most 2 bits. The extra information shows that 
the subject sometimes did not respond. This can be verified from the right-hand marginals in 
Tables I and II. The rest of the quantities in Eq. (17) are easily computed from s-terms. For 
example, 1 Od) is computed from ety - Strat We see that Ht y(y) is 1.46178 bits. This is the 
part of the response information that is not accounted for by either the auditory stimuli or the 
presponses. Consequently, 1.46178/2.03199 or 72 per cent of the response information is unan- 
alyzed error. Some 28 per cent of the response information must therefore be due to associa- 
lions between the subject's responses and the two predicting variables. 


If we consider the association between auditory stimuli (u) and responses (y), we have 


"(ys = = os die Se , 
Tusy)-=s ec Sere 


T'(usy) = 0.05780 


LOL 


Thus only 0.058 bits are transmitted from the frequency stimuli, accounting for less than 3 per 
cent of the response information. This is not surprising because the signal-to-noise ratio was 
set near the masked threshold, and the stimuli were difficult to hear. 

If we consider the association between presponses (v) and current responses (y), we 


find a little more transmitted information: 


! . = 
BAP dete Hecht yeah PK (te ; 


T'(vsy) = 0.21840 
This value of 0.218 bits transmitted amounts to some 11 per cent of the response information. 
The last element in Eq. (17) is the stimulus x response X presponse interaction A'(uvy) 


This is computed from 


! = — _— _ os ’ 
Ayo 8S eer, 85; Sim Sim + Sijm 


A'(uvy) = 0.29401 
We see that about 14 per cent of the response information is due to the interaction. Knowledge 
of the interaction also permits us to hold one of the inputs constant while measuring transmission 
from the other input. For example, the transmission from stimuli to responses with presponses 


held constant is: 


Ti (usy) = 5; —_ Sij - Sim + Siim 
= T'(usy) + A'(uvy) 
= 0.35181 


Our calculations for the parts of the response information that we can analyze with 
the three-dimensional model lead to weights of approximately 3,11 and 14 per cent for stimuli, 
presponses and interaction respectively. These figures sum to 28 per cent, the amount of trans- 
mitted information we predicted from the size of the noise term. We can also obtain this total 


weight directly by computing the information transmitted from both inputs together. We have 


T'(u,vsy)=s- Se Si; ch Sim ; 


T'(u,vsy) = 0.57021 


If we now divide this three-dimensional transmitted information by the response information, we 


get back our figure of 28 per cent. 
There are several points worth noting about our application of information theory to 


this experiment. The first is that the analysis is additive. The component measures of associa- 
tion, plus the measure of error (or noise), sum to the response information. Furthermore, the 
analysis is exact. No approximations are involved. The process is very similar to the partition 
of a sum of squares in analysis of variance. As a matter of fact, a notation can be worked out in 
analysis of variance that is exactly parallel to the s-notation in multivariate information trans- 
mission? 

The second point is that information transmission is made to order for contingency 
tables. Measures of transmitted information are zero when variables are independent in the con- 


tingency-sense (as opposed to the restriction to linear independence in analysis of variance). In 


102 


addition, the analysis is designed for frequency data in discrete categories, while methods based 
on analysis of variance are not. No assumptions about linearity are introduced in multivariate 
information transmission. Furthermore, when statistical tests are developed in a later section, 
it will be shown that these tests are distribution-free in the sense that they are extensions of the 
familiar chi-square test of independence. 

The measure of amount of information transmitted also has certain inherent advan- 
tages. Garner and Hake” and Miller” have pointed out that the amount of information transmitted 
is approximately the logarithm of the number of perfectly discriminated input classes. In experi 
ments on discrimination like the one that we have discussed, the measure provides an immediate 
picture of the subject's discriminative ability. Miller has also discussed applications of this 


property in mental testing and in the general theory of measurement. 


VII. INDEPENDENCE IN THREE-DIMENSIONAL TRANSMISSION 


It is evident from the definition of transmitted information that T'(u,v;y) = 0 when 
the output is independent of the joint input, i.e:, when 


Aistrgs ele a, (20) 


With this kind of independence, we can show that 


Sei 6 Gh = Ss 
ijm ij m 


This expression for Siam May be substituted into Eq. (5) to confirm the fact that T'(u,v;y) = 0. 


m 
Now suppose that T'(u,v;y) >0, but that v and y are independent, that is to say, 


ees a : (21) 


This leads to 


If we substitute for eee in Eq. (9), we find that T'(v;y) = 0. Equation (21) does not provide a 
unique condition for independence between v and y. To show this, let us pick some value of u 


and study the v-to-y transmission at that value of u. We now require that 
pes eee kines (22) 


If we have Eq. (22) for alli, we must have 

Siim moet Crepe | o 
and it follows, from substitution in Eq. (10), that Ti (vey) = 0. This is the situation in which v 
and y are independent, provided that u is held constant. It is an interesting case because we can 


sho,, from Eq. (14) that if this kind of independence happens, 


A'(uvy) = — T'(viy) 


103 


The sign of T'(v;y) must be positive or zero so that — T'(v;y) must be negative or zero, Conse- 
quently, A'(uvy) can be negative. We see that negative interaction information is produced when | 
the information transmitted between a pair of variables is due to a regression on a third variable. 
Holding the interacting variable constant causes the transmitted information to disappear. 

If we have the independence defined by Eq. (21), we may not necessarily have the in- 
dependence defined by Eq. (22). Let us suppose that we have both, i.e., that we have 


- SS - 
Sim 3 Sm s ? 


ath ae F : 
ijm ij im i 


Now we substitute for Sim and Siim in Eq. (8). 


\ > = - oe . 
Ti (usy) BAS TRO see sree hie 


Ti (usy) = T'(usy) 


Both kinds of independence, Eqs. (21) and (22), together mean that v is not involved 
in transmission between u and y. When this happens, we do not have three-dimensional trans- 
mission, since u is the only input variable* As might be expected, both kinds of independence 


can be generated from a single restriction on the data, namely, 


n. 
s - 1m 
ijm Vv 
where V is the number of classes in v. 
We have studied the case where v is independent of y. We could have had u independ- 


ent of y, or u independent of v. The results are analogous to those we have presented. 


Vill. CORRELATED SOURCES OF INFORMATION 


Three-dimensional transmitted information, T'(u,v;y), accounts for only part of the 
total amount of association in a three-dimensional contingency table. It does not exhaust all the 
association in the table because it neglects the association between the inputs. When this associa- 
tion is considered, i.e., when all the relations in the contingency table are represented, we are 
led to an equation that is very useful for generating the components of multivariate transmission. 


Consider 
C'(u,v,y) = H'(u) + H'(v) + H'(y)-H'(u,v.y) (23) 
lf we add and subtract H'(u,v), we obtain 


C'™(G.v.y) = T(usy) + LE (as vive ; 
Cus Vey) = Tusvy Pays Oey ee ; (24) 


* Provided that no information is transmitted between u and v. 


10), 


We see that C'(u,v,y) generates all possible components of the three correlated information 
sources u,v and y. 


IX. FOUR-DIMENSIONAL TRANSMITTED INFORMATION 


It will be instructive to extend our measures one step further, i.e., to transmitted in- 
formation with three input variables, since from that point results can be generalized easily to 
an N-dimensional input. For simplicity, we shall restrict our development to the case of a chan- 
nel with a multivariate input and univariate output. The more general case with N inputs and M 
outputs does not present any special problems, and can be constructed with no difficulty once the 
rules become clear. 

Let us add a new variable w to the bivariate input u,v. The joint input is now u,v,w. 
We suppose that w sends signals h =1,2,3, ... W. This gives us four sources of information 


u,v,w, and y. We can proceed to define a four-way interaction information A'(uvwy) as follows: 
A'(uvwy) = A‘ (uvy) — A'(uvy) 


We have already defined A'(uvy). The definition of At (uvy) will be similar, except that the sub- 
script w indicates that A'(uvy) is to be averaged over w. As we have already noted, this is ac- 


complished by adding the subscript h to each of the s-terms that make up A'(uvy). Consequently, 


' =— = — <= 
Bel MY = Sy, ch spat Bet Sp = Shi)—'Shim him * SAigm (29) 


{t is readily shown that A'(uvwy) is symmetrical in the sense that it does not matter which vari- 


able is chosen for averaging, i.e., 
A'(uvwy) = Al (vwy) - A'(vwy) 
= A‘ (uwy) -A'(uwy) , 
co: WH LA) Bes Nal Ng 
= Aauvw) -— A'(uvw). (26) 
We see that A'(uvwy) is the amount of information gained (or lost) in transmission by controlling 


a fourth variable when any three of the variables are already known. 


If we examine all possible associations in a four-dimensional contingency table, we 


obtain 
C'™, vw y) HoT av) eT (usw) eT (ay) ee ED vew) eed (vy) ality) 
+ A'(uvw) + A'(uvy) + A'(uwy) + A'(vwy) + A'{uvwy) (27) 
where 
C'(u,v,w,y) = H'(u) + H'(v) + H'(w) + H'(y) - H'(u,v,w,y) 
Equation (27) can be proved by expanding both sides in s-notation. It turns out that, 
in the general case, C'(u,v,w, .-.. y) is expanded by writing down T-terms for all possible pairs 


of variables, and A-terms for all possible combinations of three, four variables and so on. 
Four-dimensional transmitted information from u,v,w, toy, i.e., T'(u,v,wsy), can 


be written as follows: 


105 


T'(u,v,wsy) = H'(y) + H'(u,v,w) -—H'(u,v,w,y)  . (28) 


The same arguments are used to justify Eq. (28) as were used in the case of Eq. (4) in three- 


dimensional transmission. To find the components of T'(u,v,w;y), we note that 
T'(u,v, wiry) = C'(u,v,w,y) — C'(u,v,w)  ; (29) 


This means that T'(u,v,w;y) contains all the components of C'(u,v,w,y) except the correlations 
a 


among the inputs. Consequently, the components of T'(u,v,w;y) are 
T'(u,v,wsy) = T'(usy) + T'(vsy) + T(wsy) + A'(uvy) + A'(uwy) 
+ A'(vwy) + A'(uvwy) . (30) 


The components of T'(u,v,w;y) are shown in schematic form in Fig. 2. 


If it happens that 


"hijm ~ Diim/W : 
where W is the number of classes in w, all the components of 
C'(u,v,w,y) that are functions of w drop out, and C'(u, v,w,y) = 
C'(u,v,y). In similar fashion, C'(u,v,y) can be reduced to C'(u,y). 
This is precisely what we did in the analysis of independence in 
three-dimensional transmitted information. Since C'(u,y) = T'(u;y), 
we see that all cases of transmission with multivariate inputs can 
be related to the bivariate case. 

Giqa2aeSchembtherdiagtam With three inputs controlled, we are ready to extend 

of the components of four-dimen- the analysis of response information in Sec. V a step further. 

sional transmitted information Weave 


with three transmitters and a 
single receiver. 


H'(y) =H! ty) + T(u.vewsy) (31) 
Equation (31) says that we can measure the effects in response information due to the three inputs 
This is evident from the fact that Eq. (30) tells us how to expand T'(u,v,w;y) in its components, 


In addition, we know that 


i] = ' 
Hi yl) 9 eae 


CY) tT iV 3Y) cones (32) 
where 
ThylWiy) = T'(w;y) + A'(uwy) + A'(vwy) + A'(uvwy) F (33) 


We see that controlling w in addition to u and v enables us to rescue the information transmitted 


between w and y from the noise, and to replace BG with a better estimate of noise information 


namely, ee 


The transition to N-dimensional input is now evident. In general, we have 


EL (yi Ean 7 egy) ok (us vee elegy)? Me (34) 


106 


The N + 1 dimensional transmitted information T'(u,v,w, ...,Z;y) can then be expanded in its 


components in the manner that we have described. 


X. ASYMPTOTIC DISTRIBUTIONS 


Miller and Madow’ have shown that sample information is related to the likelihood 
ratio. Following Miller and Madow, we can show that the large sample distribution of the like- 
lihood ratio may be used to find approximate distributions for the quantities involved in multi- 
variate transmission. 

Consider, for example, three-dimensional sample transmitted information T'(u, v;y). 
We can test the hypothesis that T(u, v;y) is equal to zero. This is equiv4lent to the hypothesis 
that 


p(i,j,m) = p(i,j) - p(m) ’ (35) 


since T(u, v;y)is zero when input and output are independent. This hypothesis leads to the like- 
lihood ratio (cf. Ref. 7), 


-2n eh om 
no? O(n, 3) 9 Ot (o,) 
hes oJ F (36) 
-n I ijm 
Thee is (kee) 
is jem im 
If we take logs, we obtain 
-—2 log Xx 
SSS TEIN hr Or 
TmeEs Ae Fin, “ap in ¢ 2 
210g 5h = L-3863)n Tu, vey). . ; (37) 


For large sampl&, —-2 log, X\ has approximately a Nee distribution with (UV — 1) (Y — 1) degrees 
of freedom when the null hypothesis of Eq. (35) is true. Thus 1.3863 n T'(u, v;¥) is distributed 
approximately like ee if T(u, v;y) is equal to zero. 

A more important problem involves testing suspected information sources. Suppose 


in our three-dimensional example, we assume that 


p(i,j,m) = p(i) - p(j) - p(m)_. (38) 


This hypothesis leads to the likelihood ratio for complete independence in a three-dimensional 


contingency table, 


-3n ah oe heel 
n Tutne yy Ml(n:) eel (n,) 

d= ‘ Jai ole 
ain 10 


i,j,m 


ny, (39) 
(in) 


ijm 


After we take logs, we find that 


107 


—2 log, ny 


saewieaee = -s.-S.- -sts.. 
1.3863 n oral 4) ss ss Sijm 


H'(u) + H'(v) + H'(y) - H'(u,v,y) 


iT] 


-2 log, X = 1.3863 n C'(u, v, y) 


For large samples —2 log, \ has approximately a Me distribution with (UVY—-1) — (U-1) —-(V—-1) - 
(Y—1) degrees of freedom when the null hypothesis is true. 


We also know that 


C'(u,v,y) = T'(usy) + T'(vsy) + Ti(usv) : (41) 


The likelihood ratio can be used to show that 1.3863 n T'(u;y) and 1.3863 n T'(v;y) are 
asymptotically distributed like x? with (U — 1) (Y — 1) degrees of freedom and (V — 1) (Y¥ — 1) de- 
grees of freedom, respectively, if T(u;y) and T(v;y) are zero. To find the asymptotic distribu- 


tion of Tare) we make the following hypothesis: 


p(i,j,m) = p(i,m) - pj), (42) 


where Pin (J) is the conditional probability of j given m. 


Now we have the ratio 


n n Ade: 

i T (n, We au oP asi 

ij. nim jon nee 
ee — (43) 

oe (a...) aie 

HIE COWES § bE) 
—2 log, nN 

1.3863 n ~ °*m~ Sim 8jm * Sijm ’ 

—2 log, N= 18 3863..n Ty(usv) : (44) 


In this case, -—2 log, \ has Y(U - 1) (V—1) degrees of freedom. In view of Eq. (41), we can write 


1.3863'n C'(u,vyy) =21.3863 n [ Ti(usy) (vey) + Tusy)] ; (45) 


The quantities on the right side of Eq. (45) have degrees of freedom that sum to (UVY—U-V-Y +2) 
Since this is the same number of degrees of freedom as on the left hand side of Eq. (45), the quan- 
tities on the right side of Eq. (45) are asymptotically independent, if the null hypothesis 
p(i,j,m) = p(i) + p(j) - p(m) 

Iseirues 

This means that, as an approximation, we can test T'(u;y), T'(vsy) and Ty) si- 
multaneously for significance under the null hypothesis we have stated. The test is very similar 
to an analysis of variance. We can see the similarity by applying the test to the data from our 
example in Sec. VI. The significance tests will be made on the quantities in Eq. (45). To do this, 
we need to compute C'(uvy) and Be) since these terms were not discussed in Sec. VI. First, 
we note that C'(uvy) is the total amount of association in the stimulus xX response X presponse 


table. We have 


108 


1 — 
CU) dt SAa Fishes SitwSion 2 , 


C'(uvy) = 0.69055 


We also need eva the information transmitted from presponses to stimuli with responses 
held constant. This measures how successfully the presponses predict the auditory stimuli. 
Since stimuli were chosen at random, we do not expect much transmitted information here. The 


computation goes as follows: 


] 
n 
I 
n 
! 
n 
+ 
n 


Tytry) = 


T'(u;v) + A'(uvy) ; 


0.41435 


We may now put our computed values for C'(uvy), T'(usy), T'(vsy) and T! (u;v) into Eq. (45) and 
perform the x? tests. The results are summarized in Table II]. We fee not attempted to cal- 
culate the significance level of C'(uvy) because we do not have enough data to sustain the 88 de- 
grees of freedom. The same criticism can probably be leveled at our test for Tye) In any 


case, Table III shows that the only significant effect in the experiment is the presponse-response 


association. 
TABLE III 
TABLE OF TRANSMITTED INFORMATION 
Transmission Component —2 log, x Degrees of 
Freedom 


Stimulus - Response T'(usy) 10.016 12 >.50 


37.844 


Presponse-Response T'(v3y) 


71.802 


Presponse-Stimulus Ti yfary) 


Total | C'(u,v,y) 


119.664 


One interesting fact that the analysis brings out clearly is that we cannot decide 
whether an amount of transmitted information is big or small without knowing its degrees of free- 
dom. In our example we find the Tue) = 0.414 bits, while T'(v;y) = 0.218 bits. Yet T'(v;y) is 
significant and CY) is not. The reason lies in the difference in degrees of freedom. Miller 
and Madow have discussed the amount of statistical bias in information measures due to degrees 
of freedom, and have suggested corrections. 

In Table III, we tested Te(UeY) the association between presponses and stimuli with 
responses held constant. This association is broken down still further in Table IV. No proba- 
bility is estimated in Table IV for the interaction term A'(uvy) because its asymptotic distribution 
is not chi-square. All A-terms are distributed like the difference of two variables, each of 
which has the chi-square distribution. The distribution of this difference is evidently not chi- 


square because the difference can be negative. Its density function has been derived by Pearson, 


1LO9 


Stouffer, and Dad. but the writer has been unable to find a table of the integral. 


the problem can be circumvented by combining A-terms with T-terms to make new T-terms. 


[See, for example, Eq. (33).] 


In some cases 


However, in other cases, the interactions are genuinely interesting 


in their own right, and should be tested directly. These cases can be treated when adequate 


tables become available. 


TABLE IV 


TABLE OF TRANSMITTED INFORMATION 


Transmission Component —2 log, ny Degrees of 
[ Freedom 
pressponscasumuns T'(u;v) 20.853 12 
Interaction A'(uvy) 50.948 
Total Tyan) 71.802 


** Probability not estimated. 


REFERENCES 


R.M. Fano, “The Transmission of Information — Il,” Technical Report No. 149, 
Research Laboratory of Electronics, M.1. T. (6 February 1950). 


W. R. Garner and H. W. Hake, Psychol. Rev. 58, 446-459 (1951). 


L.Dolansky, “Table of p log p,” Technical Report No. 227, Research Labora- 
tory of Electronics, M.1. T. (2 January 1952). 


W. J. McGill, “Multivariate Transmission of Information and its Relation to Analy- 
sis of Variance,” Report No. 32, Human Factors Operations Research Labora- 
tories, M. 1. T. (May 1953). 


G.A.Miller, Amer. Psychologist 8, 3-11 (1953). 


G. A. Miller and W. J. Madow, “Information Measurement for the Multinomial Dis- 
tribution” (in preparation). 


A.M. Mood, Introduction to the Theory of Statistics (McGraw-Hill Book Co., Ine., 
New York, 1950). 


E.B.Newman, Amer. Jour. Psychol, 64, 252-262 (1951). 
K. Pearson, S. A. Stouffer and F.N. David, Biometrika 24, 293-350 (1932). 


C. E. Shannon and W. Weaver, The Mathematical Theory of Communication (Univer- 
sity of Illinois Press, Urbana, 1949). 


. J.E.Keith Smith, “Multivariate Attribute Analysis” (in preparation). 


F.L.Stumpers, “A Bibliography of Information Theory,” Technical Report, Re« 
search Laboratory of Electronics, M. 1. T. (2 February 1953). 


110 


T'(uy;y) 


Fig. 1 - Schematic diagram of the components of three- 
dimensional transmitted information. The di- 
agram shows that three-dimensional transmission 
can be analyzed into a pair of bivariate trans- 
missions plus an interaction term. The meanings 
of the symbols are explained in tie text. 


T* (u,yws y) 


Fig. 2 - Schematic diagram of the components of four- 
dimensional transmitted information with three 
transmitters and a single receiver. 


CHOICE AND CODING IN INFORMATION RETRIEVAL SYSTEMS 


Calvin N. Mooers 
Zator Company 
Boston, Mass. 


Introduction 


Information retrieval machines are devices for indexing and selecting information in a 
library. The operation of these machines is based upon some sort of an arbitrary code system, in 
terms of which the machines' operations are defined. Because they use coding systems to deal with 
information, such machines have much in common with devices for point-to-point signalling, and in 
particular, with multiplex transmission systems. One is thus led to inquire whether--or in what 
manner--the formalism of communication theory as developed for signalling can be applied to machines 
for information retrieval. This paper presents several results from such an inquiry. 


In general, it can be said that the methodology and approach of communication theory has 
been helpful in building a theory of information retrieval systems. Three topics are discussed in 
this paper. 1) The retrieval system analogue to H= - Zp, log p, , the measure of the output of 
a source, is developed. 2) The retrieval system analogues to synchronous and asynchronous multiplex 
types of coding are described and channel capacities are discussed. For the latter type of coding, 
called "superimposed," a new limit on channel capacity of log,2 bits per site is given. 3) Selection 
errors due to coding are discussed. It is shown that the frequency of errors can be made arbitrarily 
small for the superimposed type of coding, in analogue to the result for signalling. 


The Retrieval System Model 


While there are a variety of retrieval systems, such as those based upon decimal classi- 
fications or uvon alphabetical indexing, this paper is restricted to those systems in which a 
machine serially scane the marks on each tally from a battery of tallies to make the selection of 
the desired documents.! For instance, the machine may be one which scans the vunches in a pack 
of Hollerith cards, making certain selections as it does so. Each document in the library is rep- 
resented by a tally, and marks and blanks on the tally carry in digital form indications of the 
subject matter in the document. 


There are a variety of opinions and vroposals on how to deal with the semantic aspects of 
the subject matter of a document to prepare it for the digital coding. Unlike the situation in 
communication theory, the semantic problem cannot be dodged in information retrieval. The method to 
be described here is thus not the only one. However, it hes the advantage of having met the test of 
wide usage, and of being simple. In essence, there is a restricted set or repertory of semantic 
units called "descriptors. "2? Each descriptor stands for a pre-assigned scope of meaning. For 
each document, a subset is formed from those descriptors whose scope of meaning touches upon the 
semantic content of the document. Other than their being grouped in a subset, no other inter- 
relationship is made between the descriptors of a subset. For document D,, we shall denote its 
descriptor subset by D;(S). A selection of useful documents is orescribed in terms of one, two, 
or more descriptors conjointly. If the descriptors S,, S,, and 3, are in the selection prescribing 
subset R(S), we require that the selector machine shall seggregate tallies of documents D; for which 
all of the descriptors in the subset R(S) are contained within. the subset Dy(S). 


It will help to have before us some numbers illustrative of the problem. The number of 
documents in a typical collection, and thus the number B of tallies in a battery, may be in the 
order of 10,000. On a tally the number of sites F available for code marks and blanks is in the 
order of 200, though in different operating systems it may range from 40 to 500. In practice it has 
often been found that an average of approximately 8 or 10 descrivtors are used in the characteri- 
zation of any document such as a technical report, journal article, or the like. In practice it is 
often possible to set an upper limit kp on the number of descriptors in the subsets D;(S), where 
k, is almost never exceeded. At the other end, it is also often possible to set a lower limit kg 
on the number of descriptors in the selection prescribing subsets R(S), such that the number of 
these descriptors almost never is less than kg. The existence of such "limits" in practice is 
very important. 


The Measure Of Required Choice Per Tally, J 


In communication theory the quantity H = - 2 P; log Ps is a measure of variety in the 
output of a source. Described in another way, H is in the nature of a weighted average of the 
minimal number of binary digits (bits) required to express the choice between the possible messages 
of a source, the messages being independent, and p; being the a priori probability of the ith 
message. The measure of choice H is important because it directly determines the least amount of 
digital "content" that a signalling, channel may transmit in order that a receiving terminal may 


reproduce the output of the source. 
Wale 


As a matter of terminology, and to prevent confusion, we shall refrain here from identi- 
fying H with “amount of information." H is a measure of digital representations and their frequen- 
cies--abstracted from semantic information. Information retrieval systems, on the other hand, must 
deal with very real problems in semantic information on at least two levels: the document text, 
and the descriptors. 


We shall now inquire into the retrieval system analogue of the H of communication theory. 
A heuristic rather a rigorous argument is followed here as well as later, such an argument being 
appropriate to the present state of the retrieval art. A retrieval selection is prescribed in terms 
of a descriptor subset R(S) taken from a set of descriptors numbering V. The result of the retrie- 
val operation is the selection of a subset of tallies from the 8 tallies in the battery. The 
selected subset may have no tally, one tally, or many tallies--with all cases being useful. Fortu- 
nately we do not have to deal with the very general problem of mapping all subsets into subsets. 
The nature of the information in the documents narrows the problem. We specifically note that the 
operation of choice during machine selection is upon the battery of tallies numbering B, and not 
upon the descriptor set numbering V. There is (in the first approximation) no reason to expect some 
tallies to be chosen more frequently than the others. Thus, the specification for choice of any one 
tally from the set of B tallies requires a minimum of logoB bits to express the choice. Further- 
more, the specification is made in terms of at least k, descriptors acting conjointly. Presuming 
also (as a first approximation) that the various descriptors are used with equal frequency, each 
descriptor need carry more than (1/k,)logoB bits of choice. 


Each tally, in ite code of marks and blanks, must indicate a capability for selection 
according to the descriptors Dj(S) which number k;. The number F of sites available for marking 
on a tally is fixed (according to the model under consideration). We cannot do as in communication 
theory and consider merely the average number of bits required. We must consider instead the case 
of the maximum number of bits, and be sure that this case is under control. The maximum is the 
case of a tally having k, descriptors. Therefore, in the model given, a tally must have a capacity 
for indicating a choice of 

J = (k,/k,)log>B bits per tally. (1) 


The quantity J plays the part in information retrieval which is analogous to the vart vlayed by 
H in communication theory. H sete the minimal channel capacity compatible with the accurate 
reproduction of the output of the source. J sets the minimal tally capacity compatible with the 
accurate retrieval selection uvon the set numbering B. 


We specifically note that for a less restricted model, the quantity J can be somewhat 
smaller, and in fact various weighted means become possible. Qne such is J = 2a; (ky /k Jlog5B in 
the case the effective F is permitted to vary in certain ways. This is no surprise, since fl in 
communication theory can also become smaller when account is taken of any non-independence of the 
output probabilites of the source. Aside from mentioning that J can be refined in various directions 
we shall not pursue the matter further. On a practical level, the value in (1) is adequate for 
specifying the problem. 


As a numerical example, we take the illustrative case given previously. For B = 10,000 
and with the typical values of k, = 2 and ky = 15, the required choice per tally is J = 100 bits. 


The Two Kinds Of Coding Systems 


There are two basically different kinds of coding now being used in information retrieval 
systems. For convenience, they are distinguished here by the terms “binary coding" and "superimposed 
coding." The latter, in the form of superimoosed random coding,is also known under the trade name 
"Zatocoding. *> Coding in retrieval systems is closely analogous to coding in multiplexed signalling 
systems. On each tally, which may be compared to a time segment of a signalling channel, each 
descriptor must operate effectively and independently of the others. Retrieval system "binary coding" 
is very much like a time division pulse code miltiplex signalling system. Superimposed coding is 
very much like an asynchronous time division multiplex signalling system--although its axect 
signalling parallel has not yet appeared in practice. Related systems have been mentioned. 7 the 
salient features of the binary and superimposed coding methods are as follows: 


Binary Superimposed 
The F sites of each tally are partitioned The F sites are not partitioned; the descriptor 
into groups of N sites each, with each of the code patterns rely upon statistical randomness to 
F/N groups able to take one descriptor code insure separation. 
pattern. 
A descriptor pattern is a binary numeral A descriptor pattern consists of marks only, with 
consisting of marks and blanks with a total of the N marks of a pattern being distributed over 
N digits. the F sites. 


ants) 


Binary 


Descriptor patterns are added to a tally by put- 


ting the new pattern into any unoccupied group 
of sites. 


Tally selection according to a descriptor 
pattern requires that in some tally group of 
sites there is a pattern whose marks and 
blanks agree completely with the selecting 
cescriptor pattern. 


Tally selection according to several conjoint 
cGescriptore requires that the pattern for 
each and every selecting descriptor must be 
found somewhere in the selected tally. 


Changing a mark to a blank on a tally, or 
vice versa, will completely change a code, 


giving it the indication of another descriptor. 


Descrivtor codes are non-interfering, being 
in separate partitions. There are no other 
sources of digital noise. 


The tally is completely filled when there is 
a descriptor pattern in each partition. 


Selection according to k descriptors requires 
the making of the eouivalent of k times F/N 
separate pattern-matching attempts for each 
tally scanned. 


Suverimposed 


Descriptor patterns are added to a tally by 
Boolean addition of the new pattern to the marks 
already on the tally, i.e., by pattern superim- 
position. The only tally sites left unmarked are 
those not having a mark from any descriptor code. 


Tally selection according to a descriptor pattern 
requires that there is a mark in every tally site 
corresponding to a mark in the selective pattern, 
and other marks and blanks in the tally make no 
difference. The pattern of selecting marks is 
included within the pattern of tally marks. 


Tally selection according to several conjoint 
descriptors requires that the Boolean sum of all 
the selecting descriptor patterns must be included 
within the total tally pattern of marks. 


Changing a blank to a mark on a tally does not 
preclude selection, and has little effect. Chang- 
ing a mark to a blank may prevent several 
descriptors from selecting a tally. 


Superimposing the codes does cause a kind of 
interference by changing blanks to marks. Desired 
tallies are never excluded by the interference. 
The noise does result in the spurious appearance 
of a few extra tallies. 


The tally is optimally used when the density of 
marks approaches 50 per cent. 


Selection according to any number of descriptors 
requires only one pattern-matching attempt per 
tally. 


Coding according to somewhat inefficient versions of the "binary" method has been used 


widely for a very long time in information retrieval machines. 


The "superimposed" method with 


random patterns is more recent. A detailed discussion of it will be found elsewhere. 21599110 

A striking difference between the two methods of coding is the difference in complexity of the 
selector required. The binary coding selector machines must try many different pattern matchings, 
and as a result are either slow or expensive. Machines for superimposed coding, having a simpler 
task, are fast and relatively simple. In payment for this considerable advantage, superimoosed 
random coding reauires that the number of tally sites F be 45 per cent greater for the same J. 


Capacity In Bits Per Tally -- Binary Coding 


A battery of tallies, quite as well as a signalling channel, has a measurable capacity for 
carrying the coded indications of choice. In the case of binary coding, the battery itself is inher- 
ently noiseless--presuming no malfunction of the selecting machine or other accidents. It is thus 
plausible to speculate that each tally has a capacity of one bit per site, or F bits per tally. 
Arguments substantiating this speculation are given here. A veculiarity tyvical of information 
retrieval systems arises. While the “channel” itself is essentially noiseless, efficient codings 
which are capable of exploiting approximately the full capability of each tally must necessarily 
give rise to a small but finite selection error. That is, for binary coding systems wherein the 
number of tally sites F becomes approximately equal to the quantity J, a type of error appears that 


may be called “code synonym error." 


The error stems from the fact that J = (k,/k, )10g58 takes no accout of the number V of the 
descriptors in the repertory set. This omission is quite correct, however, since tally choice only 
involves a choice from among the B tallies. It does not require a dependence upon V. An efficient 
coding allows only N = (1/k, )1logoB digits per descriptor. However, N digits can form only 2N 


different binary patterns, and V may be larger than 2N, 
For example, in the illustrative case already described, with B = 10,000, k, = 2, and 


error.) 
V = 500, the quantity N equals 7 and al 


(When V is not larger, there is no synonym 


a= 


is only 128. As a result, with such an efficient coding 


system, there is a four-fold doubling up. Four descriptors must use the same coce vattern. However, 


no actual retrieval selections are specified by only a single descriptor at a time because Kaas 


fe 


Consequently, the probability of error due to code synonyms is in the order of (4/500)? per tally, 


ny 


which is somewhat less than 1/10,000. We note that the code synonym error does not erroneously 
exclude any tallies that should be selected. Instead it errs (in the worst case of two descriptors) 
by permitting the appearance of unwanted tallies, with a probability of less than one unwanted tally 
for each selection upon the entire battery of B tallies. In practical retrieval systems, such an 
error is no handicap whatsoever. The fact that B = 10,000 and that the error per tally is approxi- 
mately 1/10,000 is not coincidental. It is a consequence of the logic behind the particular defi- 
nition of J, requiring a maximal power of choice of one tally from a set of B tallies for the mini- 
mum of k, prescribing descriptors. 


The ordinarily-used retrieval systems using binary coding are not designed for such coding 
efficiency. This is due to the uncritical belief that descriptors must_be coded with unique patterns, 
i.e., that N must be sufficiently large so that 2N is greater than V. 1 In our illustration, an 
efficient binary coding (with the four-fold synonym redundancy) with F = 200 available sites, allows 
ky, = 200/7 = 28. This permite a choice-indicating ability of J = 186 bits per tally. (Because there 
cannot be fractional digits in the codes, the round-off errors will usually make J somewhat smaller 
than F, even for efficient coding.) In comparison, by the more wasteful, but more usual, coding of 
the same tally, for V = 500 we have N - 9. As a result k, = 22 and J = 144 bits of choice per tally, 
an appreciable loss in capacity. Such a loss in choice per tally due to poor coding is even greater 
for larger values of V. 


Following Shannon, in defining the channel capacity as the maximum transmission rate that 
can be achieved by a set of codings, we define the maximum tally capacity as the maximum number of 
bits of choice that can be carried. It is plausible to conclude that when the maximum is taken 
over the binary codings, the tally has a maximum capacity of one bit per site, or F bits per tally. 


Capacity In Bits Per Tally -- Superimposed Coding 


With superimposed coding some unexpected things happen when one tries to compute the 
analogue of channel capacity. The coding is inherently noisy, thus there is equivocation. Only the 
marks have meaning. The marks of one pattern interfere with other patterns, changing blanks to marks. 


3,8 The design of superimposed coding systems for specific problems has been treated elae- 
where.~? The conclusions only are presented here. Each descriptor is given a pattern of N = 
(1/k,)1log,B marks. (This presumes that the design calls for an occurence of no more than one 

extra seléction on the average for every scanning of the battery.) The descriptor patterns are 
generated by some random or quasi-random process, thus allowing a statistical separation of the 
patterns to operate. (This is equivalent $0 multiplexing on the basis of statistical improbability 
of the channels interfering inordinately.') Optimum usage of the tallies reauires that aporoximately 
SO per cent of the tally sites be marked. For densities appreciably beyond this, interference, shown 
by extra selections, sharply increases. The 50 per cent criterion is met (conservatively) by setting 
the maximum k, at the value k= (log.2)F/N = 0.69F/N. 


Let us now put these Feats aside, starting instead from the beginning with Shannon's 
formalism of communication theory." We shall be concerned with finding the maximum capacity of a 
tally under superimposed codings. Let each tally with F sites be coded with k descriptors of N 
marks per code pattern. Because we use random codes, the probability that any site will have a 
mark from the k patterns is P, = 1 - (1 - N/F)k, For any one tally, consider the joint probability 
p(i,j) that when an indication "i" is marked into a site by a descriptor the indication "j" will 

be found on the fully marked tally. The a priori probability of any one site being marked by a 
descriptor is N/F. If a tally site is thus intentionally marked (indication "l"), the tally will 
remain marked at the site and p(1,1) = N/F. No mark will change to a blank, so p(1,0) = O. For 
sites not intentionally marked by the descriptor--but possibly marked by any of the k-1 others-- 

we have p(0O,1) = (1-N/F)(P,_,) and p(0,0) = (1-N/F)(1-P) ). With these assumptions, the 
transmission rate or tally capacity R in bits per site can be computed according to Shannon's 
formulas. The details are found in the appendix. 


For all k descriptors of the tally, the total capacity must be k times the capacity for 
one descriptor, or R; = kR bits per site. Maximizing R, as a function of k (see appendix) leads 
directly to these conditions on k,N, and F: 


kN/F 


'"! 


log.2 = 0.69 (2) 
which in turn is equivalent to: Pls + (3) 
These are the conditions for optimal use of the tally, giving it maximal capacity. Under these 
conditions, and for N/F small (as it usually is) the average tally capacity per descriptor is N/F 
bits per site (see appendix). Thus the telly carries just N bits for each descriptor. The total 


tally capacity under these optimal conditions is 


kRF 


(F/N)(log.2)(N/F)F = Flog,2 total bits per tally (4) 


or loge2 = 0.69 bits per site. (5) 


115 


From this it is apparent that the total capacity of a tally with superimposed coding 
cannot exceed log,2 bits per site. Similarly, in any asynchronous multiplex communication system 
based upon any analogous use of superimposed signalling patterns, the transmission rate cannot 
exceed 0.69 bits ver pulse width. This conclusion seems to be a new result for this kind of coding. 
However, in a narrower sense, the same conclusion was reached by the author by another method 
earlier, +10 Also, @ figure in a paper by White Conte ie a hint of the conclusion although he 
states he was unable to determine the maximum efficiency. 


Selection Errors, Noise And Coding 


As has been shown, binary coding for retrieval selection has no "noise" and the only errors 
are due to coding synonyms in the case of an efficient coding. Superimposed coding is inherently 
noisy. With regard to it, we can look for-an analogue to Shannon's result for a noisy channel. He 
has shown that it is possible so to encode a message thet the probability of error becomes arbitra- 
rily small, if sufficient time delay is allowed. In particular, he has stated that the probability 
of error in a given message sequence is bounded by 


p < 2-1(C-H) (5) 


where T is the length of the sequence, P is the vrobability cf error, C is the channel capacity, 
and H is the rate of generation of the source. (It is to be noted that, in general, codings which 
achieve this result have not been achieved in practice. ) 


For superimposed coding, when no more than optimal coding density of marks is used, the 
orobability P of error per tally in selection is bounded by 


p< 276 (7) 


where G is the total number of marks in the superimposed selection pattern from the selection- 
prescribing descriptors of the subset R(S). The result is simply explained. At optimal use, the 
density of marks averages %. An erroneous selection can occur only if there ie a chance occurence 
of a mark in each of the G sites of the selective pattern. For each site the chance is 3, and for 
G sites simultaneously, it is less than 27%. It is to be noted that superimposed codings do in 
practice achieve results of this order, though deviations from it may occur due to correlations 
between certain descriptors as they are used to describe the documents. 


This last analogy is no mere formal accident. Looking to the communication side of the 
analogy, we see that the message contained in the sequence of length T can be expressed in TH bits. 
This leaves T(C-H) bits of redundancy or choice for use in case of equivocation to retrieve the 
correct message from other possible but non-wanted messages. Each of the T(C-H) bits gives an 
improvement in error by a factor of $. 


On the information retrieval side of the analogy, the actual message is in the document, 
80 there is no purpose to coding it on the tally. All of the G marks of the selective pattern are 
available for choice to overcome the equivocation between documents of the collection and to 
retrieve the desired documents. Each of the G marks decreases the error by a factor of $. 


Concluding Remarks 


We have seen that information retrieval systems are susceptible to treatment by communi- 
cation theory at the coding and machine level, and that there are a number of analogies between 
retrieval systems and miltiplex signalling systems. Historically, retrieval theory has been aided 
by communication theory. In the other direction, there is reason to believe that developments--both 
theoretical and practical--originally made with retrieval systems may be applicable to the develop- 
ment of signalling systems. For instance, some retrieval practice seems to be ahead of work in 
asynchronous multiplex signalling. For another thing, techniques for handling semantic information 
in retrieval--not discussed here--may be suggestive for further development in communication theory 
if and when such matters are undertaken. 


116 


Appendix 
We shall verify for the case of superimposed random coding that the maximum tally capacity 
occurs when kN/F = log.2 with N/F small, and that under these circumstances the number of bits per 


descriptor is N. Where p = N/F, q=1- p; P = PL cik? = (1 - p)K-1, and Q = 1 - P, and following 
Shannon's formalism closely: 


pit) p(t, 0) P “a: 


< (8) 
P (0, 1) i Co, 0) gP Ep ef) 


H(x,y) = — Spl j) Boy PGI) = =p ty p —phen eT pS ea ke (9) 
PAP ey Seopa cepa -2t (10) 
(x) & pj) toy S pC) = — plop p PAGE 


Hty) = — PCy) to & papi). = —Cpegt en pegt)— ph Seem 


R= Hex) + cy) - Hy) OP) 
=p Magy + p Pig pP-CpepP) Seg (rr et) eH 
R, = RE (14) 
Maximizing R, as a function of k: 
id 
at = R+RSE =O (15) 
Make use of: ee 
2 = 1) 2o3.CI-p) (UI-P) (16) 
k, ae Sepa 
—kpee = (i-p) Rog I-p) (17) 
ON 
ae eats rie swans? dog (p+p) 
mete wa aile 


+t 
= “K ~ Cop)" tage (=) ag Ee (19) 


rp Ci-p)* a Lt throughout. This is the same as P, = 2) or kN /F= fog 2. (20) 


a7 


SRE = — (1p) Bog (1-P) - fe ner ) tog (4 - b) i 
—(+) tez0 (4) 4 


Expand for small p, term by term: 


—(2oy,2)-p) +e + be + C14 dque)-r)] + (a) tene*Aeas ) 
for Cops) SRO 2) Note tee (23) 


The first term is R, so: 


Rep = N/E (24) 


I 


N 


References 


1 R. S. Casey & J. W. Perry, editors, "Punched Cards, Their Applications To Science and Industry," 
New York, Reinhold, 1951. 


2 ©. N. Mooers, “Scientific Information Retrieval Systems For Machine Operation--Case Studies In 
Design," Zator Technical Bulletin No. 66, April 1951. 


3 C. N. Mooers, "Zatocoding Applied To Mechanical Organization Of Knowledge,” American Documentation 
pp-20-52, January 1951. 


4 C. E. Shannon, “A Mathematical Theory Of Communication," Bell System Technical Journal, pp. 379- 
423, July 1948; and pp.623-656, October 1948. 


5 British Patent No. 681,902 to C. N. Mooers (U.S. application Sept. 17, 1947); U.S. patents 
pending. 


6 W. D. White, "Theoretical Aspects of Asynchronous Multiplexing," Proc. I.R.E., po. 270-275, v. 38, 
March 1950. 


7 J. R. Pierce & A. L. Hopper, “Nonsynchronous Time Division With Holding And With Random Sampling", 
Proc. I.R.E., v. 40, September 1952. 


8 C. N. Mooers, "Zatocoding For Punched Cards,” Zator Technical Bulletin No. 30, 1950. 

9 C. N. Mooers, "Putting Probability To Work In Coding Punched Cards," presented at 112? meeting 
of the American Chemical Society, New York, Sentember 1947. Published as Zator Technical 
Bulletin No. 10, 1947. 


10 C. N. Mooers, “Application Of Random Codes To The Gathering of Statistical Information, " 
master's thesis, Mathematics Department, M.I.T., February 1948. 


118 


MODERN STATISTICAL APPROACHES TO RECEPTION 


IN COMMUNICATION THEORY 


David Van Meter* 
David Middleton * *>T 
Cruft Laboratory, Harvard University 
Cambridge, Massachusetts 


Summary 


When reception in the theory of communication is recognized as a problem in statistical infer- 
ence, system design and system analysis appear as the counterparts of designing and evaluating 
statistical tests. This paper discusses the optimum properties of designs based on statistical deci- 
sion theory from the risk point of view, and from that of information theory. Connections between 
risk and information loss are established, which result in a unified theory of system design. This 
includes Minimax methods capable in principle of handling all degrees of a priori knowledge of signal 
and noise statistics, new methods for comparing actual and ideal systems for the same purpose, and 
new interpretations of previously used formulations as special cases of the more general theory. 
Both detection and extraction of signals in noise are considered, the former as a problem of testing 
statistical hypotheses and the latter as one of estimating parameters. 


Formulation of the general reception problem as a decision operation is followed by a summary 
of statistical decision theory from the risk point of view, with some examples of Bayes and Minimax 
tests and optimum classes of decision rules. Applications to detection show the optimum nature of 
likelihood ratio receivers as a class, and indicate methods for defining the minimum detectable sig- 
nal and for comparing system performance. As an illustration, curves of Bayes and Minimax risk 
are given for detection of a pulsed carrier in noise. Applications to extraction show the nature of 
optimum extraction and the r6les of the mean square error and maximum likelihood criteria from 
the more general point of view of risk theory. Conditions under which information loss is an ex- 
tremum in detection and extraction are established, and information loss itself as a criterion of per- 
formance is compared with that of the risk measure. 


1. Introduction and General Formulation 
1.0. Introduction 


The central rdle of the noise background in reception and the noise-like character of many types 
of signal has naturally led to the application of various statistical ideas to the solution of reception 
problems. Of chief practical interest are the twin problems of ''best'' or optimum methods of de- 
tecting the presence or absence of a signal in noise, and of extracting a signal from a noisy back- 
ground. Each of these classes includes many variants, which depend mainly upon what is known about 
the signal and noise and what is chosen as the criterion of ''best.'' Perhaps earliest in point of time 
is the view that the best extractor is a transducer which accepts the mixture of signal and noise at 
its input and produces an output which is as close to the desired output as possible, the ''distance"' 
between them being defined in some appropriate way, usually as the squared error averaged over an 
interval equal to the time available for observation of the output. Wiener [1.1] originally treated the 
problem in this way and found suitable linear filters for stationary inputs and semi-infinite observa- 
tion periods. Results have since been obtained for more general classes of filters and nonstationary 
inputs }Booton [1.4], Davis [1.5], Singleton [1.2], Zadeh and Ragazzini [1.3] among others. 


The first treatments of detection were concerned with finding linear predetection filters which 
maximized the peak signal to rms noise ratio at the detector input, and hence at the detection output, 


*On leave from Pennsylvania State University, State College, Pennsylvania 
* *Support for a small pertion of this workwas provided by the Department of the Army, the Depart- 
ment of the Navy, and the Department ot the Air Force, under Contract with the Massachusetts 
Institute of Technology. 
Now at 49 Lexington Avenue, Cambridge. 


19 


when a monotonic relation between input and output is maintained. This marked a departure from 
the idea of controlling system errors, upon which the previously mentioned criterion for optimum 
extraction was based, and corresponds’ toa less complete view of the detection problem. The 
familiar "matched filter" for white noise, treated by North [1.6] and by Van Vleck and Middleton[1.7] 
was followed by the filters for "colored" noise of Den Hartog and Muller [1.8], Dwork [1.9], George 
[1.10], and Urkowitz [1.11]. Zadeh and Ragazzini [1.3] have considered problems of physical 
realizability and finite observation time which arise in this connection. Certain classes of non- 
linear predetection filters have also been discussed by Zadeh [1.12]. A second approach, which 
usually requires a highér order of a prioriinformation than those based on the simpler "distance" 
criteria, deals with detection and extraction as operations which are analogous to hypothesis testing 
and parameter estimation in the theory of statistical inference (see, for example, Middleton [1. 13). 
System design and analysis then appear as the counterparts of designing and evaluating statistical 
tests. Grenander [1.14] has shown how classical methods of inference, which assume discrete un- 
correlated samples, may be extended to stochastic processes. 


Siegert's introduction [1.15] of the Ideal Observer as a model of the human observer in the 
pulsed radar case was followed by the use of statistical criteria for actual system design by Hanse 
[1.16], Slattery, Reich and Swerling [1.17], and others. The betting curve, used by Siegert to de- 
fine the minimum detectable signal in a special instance was next employed by Middleton [1.18] in 
a unified presentation of several criteria appropriate for detection, including the sequential test. A 
somewhat different approach, used by Woodward and Davies [1.19] [1.20], viewed the receiver as 
an information processing device, and found that a receiver which presents the a posteriori prob- 
ability of the signal at its output (but which makes no "decision"), transmits the most information 
about the input signal. 


It is clear that there is an element of arbitrariness in whatever criterion of ''optimum" is 
chosen, The important thing is obviously to fit the system design to the constraints of the particular 
problem at hand, with special attention to the amount and kind of statistical knowledge available and 
the type of result wanted. On the other hand, it is equally important not to impose unnecessary 
restrictions on the solution in order to obtain an easy answer without knowing the additional com- 
plexity or cost involved and the increase in performance to be obtained by lifting these restrictions. 
If investigation shows that certain statistical knowledge about signal or noise, not ordinarily avail- 
able, would result in substantial design improvement, then certainly the cost of obtaining such in- 
formation must be a factor in system planning. 


Systems designed to maximize signal-to-noise ratio, [1.6-1.12], or to present a posteriori 
probability distributions [1.19], [1.20] may certainly be justified as "optimum!" from some point of 
view, usually that the rdle of the reception system is to assist in making some specified judgment 
about the signal. Our interest here, however, is in systems that themselves actually make optimum 
(or near optimum) judgments or decisions, as it seems to us that these are of greater potential im- 
portance in practice. 


In designing a detection system, for example, one is necessarily interested in the effects of 
the various errors which may occur and in the minimum detectable signal for given error per- 
formance. Ina binary (or single-alternative) system, when "'yes-or-no" is the required decision 
there are two types of error [1.13,1.18]: a Type I error, of mistaking noise for signal; and a 
Type Il error, of mistaking signal for noise. It may be that Type I errors (false alarms) are much 
more important than Type II errors (false rests), e.g., an expensive sequence of operations may be 
initiated by an alarm, and none by a rest. Consequently there must be provision in the design for 
different relative weightings of the two types of error. Furthermore, the occurrence of errors de- 
pends critically upon the a priori probabilities of the occurrence of signal and noise. Usually, re- 
liable estimates of these a priori probabilities are not easy to obtain, but a requirement of high- 
grade performance ina given situation may warrant effort in this direction, Certainly their pos- 
sible effect must be taken into account in the analysis. 


It is clear that a criterion of maximum signal-to-noise ratio at the output of a predetection 
filter is not by itself adequate when errors are important, since no mechanism for making decisions 
as to the presence or absence of a signal is considered. Neither does the (s/n)-approach tell us the 
effects of omitting or not optimally using the available information, or how such information should 
be best processed. If such a device is added to the system, the errors can be studied, but then the 
over-all system almost always incorporates an unnecessary constraint in the method of handling the 
data. It is more reasonable, and certainly it comes far closer to answering the question of finding 
the "best'' system for a given class of decisions, to assume at the outset that the system must make 
these decisions and then to design it accordingly, making best use of all available statistical data 
within the external constraints imposed by the nature of the application. If this optimum system is 


120 


too expensive to realize, then compromises can be readily made, and evaluated by comparison with 
the optimum. [An example of this is given in Sec. (3.5)]. 


The present paper, it is believed, gives in large part a new approach to problems of reception 
in communication theory, an approach which adapts certain techniques recently developed by statis - 
ticians, to the design of detection and extraction systems. These methods here employ risk and in- 
formation loss as two possible criteria of performanceg. In the risk formulation a cost is pre- 
assigned to each possible error the system can make, and the risk is calculated as the expected 
value of the cost in view of the various error probabilities involved. The best system in this in- 
stance is defined as one that minimizes this risk. In the formulation based on information loss, the 
measure of information [2.4 ] is used to calculate system equivocation and the properties of systems 
that minimize this quantity are investigated. The risk formulation has the advantage that with pre- 
assigned costs, it brings into the open the element of arbitrariness inherent in optimum criteria, 
allowing its effects to be studied. Special cost assumptions lead to previously used criteria such as 
the Neyman-Pearson and Ideal Observers [1.18] in detection, and the least mean-squared-error 
criterion in extraction [1.1,1.12]. Minimax methods for the design of tests when the available 
statistical data are incomplete, and methods for comparing actual and ideal systems complete our 
present theory of optimum system structurization. These provide techniques which appear to fit 
more closely than earlier efforts, the actual conditions under which practical solutions to detection 
and extraction problems are required. 


1.1. General Formulation 


We begin by considering the principal elements of a typical decision situation at a level of 
generality which does not restrict us to any specific type of signal, noise,or decision, and which does 
not at this point involve special assumptions about the statistics of signal and noise. Figure 1 shows 
such a general formulation, with the pertinent notation. A decision y is to be made about a signal $ 
based on observations V of the mixture of signaland noise V = S@N. In general, signal and noise 
are functions of time, and observation is confined to a finite time interval (0,T). These observations 
may consist of a discrete set of values of the variable in the interval (discrete or digital sampling), 
or may include the continuum of its values throughout (0,T) (continuous or analog sampling). In 
either case it is convenient to consider the ''value'' of the composite variable (S$, for example) as the 
aggregate of the values of the corresponding physical quantity (S) at the appropriate instants in time 
t,...tn, e.g., S = (S},S2,...S,) and represent it as a point in a space of corresponding dimension- 
ality (n) in the usual fashion, 


The occurrence of each value is assumed to be governed by a probability distribution function 
defined over the space. These may be described by density functions if the space is continuous, and 
if discrete, as the elementary probabilities associated with each value. Thus, if the signal space &% 
contains two n-dimensional points only, Sp and |$] corresponding to known signals which occur with 
probabilities q and p, the distribution function 6(S) is defined as 


o($,)=4 
Sealing Gpeey th Gils ah) 
o($,)=P 


If S is a composite random variable, e.g., a noise wave where each of the components S,; of S$ 
possesses a continuous distribution with possible correlation between components, then o(§$) = 
o(Sj,S2,...,Sp) is an n-fold joint distribution function and there are infinitely many points in the 
space $2 Similar remarks hold for the variables N,Y,y and their probability distributions W(N), 
F(N) and $(ylV), respectively. The information about o(S) and W(N) available in any particular 
problem may not be complete, of course; we may know merely that they belong to certain classes 
of distributions 


The decision y may be any statement concerning the member of the signal class {ipresent at 
the input. In the binary* detection problem, for instance, we test the hypothesis Hp: S$ = 0(noise 


*We use here and elsewhere (1.21) the term "binary" or "single alternative" to include all cases of 
detection where only one signal on any one observation (0,T) can be present. This particular signal, 
however, may be drawn from one or more sub-ensembles, representing possible distributions over 
the parameter describing the signals, e.g., amplitude, durations, or other structure factors. Thus 
the "simple alternative" refers to an ensemble containing only one possible signal, while the term 
"one-sided alternative" refers to a signal selected from an ensemble where there is more than one 
possibility. 


121 


alone) against the alternative H]: Sy 0 (signal plus noise), and the decision space Acontains two 
points: y,: S = 0 is present at the input, and y): S$ = 0 is present at the input. Here (may contain 
only one other point besides S = 0 (simple alternative)*, or a continuum of points corresponding, say, 
to different signal amplitudes (one-sided alternative)*. In extraction, y is an estimate of the value 
or numerical measure of S present at the input, so that the dimensionality of the two is the same. 


The decision rule S(y | V) is the (conditional) probability that y will be decided when V is given, 
Ordinarily this probability is either one or zero for each V and y (nonrandomized decision rule), al- 
though the possibility of using a random mechanism to obtain y from V is not excluded in the general 
formulation. The essence of the decision situation is that §(y |V) is to be a rule for making the deci- 
sion y from the received or a posteriori data V alone, i.e., independently of any a posteriori informa- 
tion about S[cf.(3.15)]* although, of course, a priori knowledge of S is necessarily "built into'the 
decision rule. We indicate this symbolically by 


S(y|v) = S(ylv.s) . (1.2) 


The decision rule is the mathematical embodiment of the physical system to be optimized. 
Once the "'best'' decision rule is found for the problem at hand (consistent with an appropriate defini- 
tion of 'best''—which, in turn, means an appropriate choice of criterion—) the system is designed to 
perform the optimum operation thus revealed. Note that the decisions y are terminal decisions, so 
that the decision rule may involve a sequence of intermediate ''sub-decisions," as does, for example, 
the standard sequential test of an hypothesis. 


The observation space [ contains points corresponding to all sample values of the process V. 
Fo(Y) is the probability distribution function of the Y's for given $. If the form of the noise distribu- 
tion W(N) is known, as we shall assume in this paper, Fg(V) is uniquely specified by the value of S, 
i.e., Fg(V) is a parametric family of distribution functions. If the form of the noise distribution is 
unknown, however, Fs(V) becomes a nonparametric family. The theory to follow in its most general 
form includes this case [1.22-1.24}. Moreover, to discuss specifically a wide class of practical 
problems we shall further assume here that the mixture of signal and noise is additive (V=S+N)and 
that signal and noise are statistically independent. ** Thus, 


F,(¥) = W(Y-S) (1.3) 


The nature of the distributions Fg(V) is of the utmost importance, of course, as it is upon these 
that specific calculations of performance depend. They are simplified if the individual sample 
values Vj,V2,...,V, (the "coordinates" of V) are statistically independent, since then Fo(V) factors 
into a product of identical one-dimensional distribution functions. When sampling is continuous and 
V(t) is band-limited (0, B) with T >> (2B)=2¢ the sampling theorem (1.19), (1.25) shows how V(t) may 
be expressed with good approximation in terms of a finite number of uncorrelated sample values. 
Observation space may then be taken as a space with these coordinates instead of the original func- 
tion space of V(t). In many situations, however, the inherent correlations and the finiteness of the 
interval play a central role, and cannot be safely approximated away. 


More generally, Fs(Y) and [” may be replaced in Fig. 1 by a distribution Gg(x) and a space X 
whose coordinates are functionals of the continuous process V(t), 0 = t = T(1.14, 1.27). Thus, for 
example, if {¢d on is a complete orthonormal set of functions satisfying 


at 
b(t) = A, | R(s,t) ¢ (s)ds , (1.4) 
0 
where R(s,t) is the correlation function of the process, then the coordinates defined by 
ah 
x \ V(t) d, (t) dt (1.5) 
0 


are uncorrelated random variables. When V(t) is a Gaussian process, Xy is a Gaussian variable 
whose first two moments are easily found, so that Gg(x) is immediately forthcoming. When V(t) is 


*See footnote previous page 
** While this simplifies the analysis in many cases, it in no way diminishes the scope of the general 
decision theory. 


122 


not Gaussian, however, the difficulties of finding distributions of functionals like x, become formid- 
able [1.26]. Another and ultimately equivalent method of handling correlated samples is based on the 
expansion of the distribution W(V-S) about § = 0[1.13]. This proves to be more useful for the 
threshold cases of interest here. * (See Sec. 3). 


Figure 1 emphasizes that decision rules are essentially transformations that map observation 
space into decision space. In detection each point of the observation space [ (or X) is mapped into 
one or the other of the two points constituting the space of terminal decision A, In binary detection, 
this is the same as dividing observation space into two regions, one corresponding to "no signal" and 
the other to ''signal plus noise'' and carrying out the decision operation in one step, since only a single 
alternative is involved. The binary detection problem is then the problem of how best to make this 
division. Similarly, in extraction each point of observation space is mapped into a point of the space 
of terminal decisions A which in this instance has the same structure as the signal space %. If the 
dimensionality of Ais smaller than that of [ (asis usually the case in estimating a signal parameter) 
the transformation is "irreversible," i.e,, many points of [ go into a single point of A, Thus ex- 
traction may also be thought of as the division of [ into regions. Detection and extraction areclosely 
related, of course, since it is only necessary to group the points of Acorresponding to S = 0 into a 
single class labeled "signal and noise" to transform an extractor into a detector. Similarly, detec- 
tion systems are often essentially extractors followed by a threshold device that separates § = 0from 
SH 0. A system optimized for one function, however, may not necessarily be optimized for the other, 
and in this sense we may therefore consider detection and extraction as separate problems for 
analysis. 


The scheme of Fig. 1 is clearly flexible enough to encompass a wide variety of reception prob- 
lems. The problem is characterized by the information available about o(S) and W(N), the mode of 
combination, @, and the criterion chosen for the optimum decision rule. Here Fo(V) (or Gg(x)) and 

$(yl¥) are derived quantities, the latter being the "answer'' sought. In Section 3 we show how the 
risk and information-loss criteria lead to classes of optimum decision rules. Sections 3 and 4 apply 
these results to detection.and extraction with some simple illustrative examples, while Section 5 dis- 
cusses connections between the risk and information criteria. A short review of some of the new 
features of the present approach and its implications in statistical communication theory conclude the 


paper. 


2. Summary of Main Statistical Methods 


2.0. Introduction 


The theory of statistical decision functions was founded by Abraham Wald [2.1]. Since its in- 
ception the theory has been the subject of intensive research by many mathematical statisticians. A 
recent book by Blackwell and Girshick [2.2] gives the present status with an up-to-date bibliography. 


Our object here is to show how the concepts and results of this theory may be applied to practi- 
cal communication problems, and may be made to furnish a reasonable basis for system design. The 
concept of the loss function used to date in decision theory is now generalized to include criteria of 
information loss,as well as risk,in a single formulation. 


2.1. Evaluation Functions 


The first requirement in a definition of optimum system performance is some kind of an 
evaluative scheme, or basis for saying that one system is better than another. We need a way of 
assigning an evaluation to each decision rule. In view of the statistical nature of the decision prob- 
lem, this evaluation should depend on the long-run or "ensemble" performance of the system. Let 
F(S,y) be a function which assigns a loss to each possible combination of signal and decision, and 


* Although in the following discussion [, Vand F,(Y) will be used throughout, it should be understood 


that the results hold equally well with X,X and G,(X) in their places, provided assumptions as to the 
continuity, etc. of Fg(V) are transferred to G.(X). 


123 


let & denote the operation of combining these to give the decision rule § an over-all loss rating which 
takes account of all possible modes of behavior and their relative frequencies of occurrence, Inthis 
paper € will be taken as the expectation or average value of F ( &-> E), since this leads to useful 
interpretations of previous results as special cases of ours. The possibility of using other linear or 
nonlinear operations for & should not be overlooked, however. 


The conditional loss rating of $ is then defined for given signals as: 


W(S,$) = Ep F(S.y) = J [psn S(ylV) dvdy . (2.1) 
ra 
The average loss rating of § takes account of the a priori signal distribution: 
Viie.6)= Bp. HSiy) = E,v(8.§ )= J] fay o(S)F (VY) §(y| V)dS dvay - (2.2) 
CUA 


The loss function F(S.y) may or may not depend on howthe decision is reached. In the follow- 
ing we shall consider one of each type, corresponding to the risk and information loss criteria. In 
risk theory a cost C(S.y) is preassigned to each combination of signal and decision independently of§ : 


7 = Gisiy.. (23) 


The loss function for the information-loss criterion, however, is the ''uncertainty"' about § when y is 


known, (or the ''surprisal''); defined as: 


F = - log F, (Sly), (2. 4) 


where P, (Sly) is the a posteriori probability of S giveny. This clearly depends not only on Sandy, 
but on th?) decision rule in use as well: it cannot be preassigned independently of &. © 


The conditional and average loss ratings of § follow from (25) fandalZez): 


Conditional risk: r(S,8) = Ey C(S,y) (25) 
Average risk: R( o,§) = Ep, © (3-y) = E, r(S,$) (2. 6) 
Conditional Information loss: h(S,8 = -Epy log F,)(Sly) (22) 
Average Information loss: H(#,&) = Ep, log PASly) = E, h(S,5 ). (2. 8) 


The last of these is the well known "equivocation" in information theory [ 2. 4]. 


2.2. Minimax and Bayes Criteria. 


_A central feature of the present theory is that the question of a priori probabilities*, avoided 
for the most part in classical statistical testing procedures and the subject of much controversy when 
considered [2.5], is here squarely faced, as indeed it must be for satisfactory solution of the practi- 
cal problems of reception, These prior probabilities are often not known and often cannot be ob- 
tained (they may not even exist ina satisfactory philosophical sense). In that case a basis for de- 
sign is provided by the Minimax criterion: 


A Minimax decision rule $6 is one whose maximum 

conditional loss rating (over all signal values) is not 
greater than the maximum conditional loss rating of 
any other decision rule § 


Max Max 
5 TSO) S 5 FSS) , alls - (2.9) 


The Minimax principle may be criticized as being too conservative. When the decision situation is 
regarded as a game [2.6] between Nature and the statistician, for example, we observe that Minimax 


12) 


strategy on the part of the latter is reasonable only if he assumes that Nature deliberately chooses 
for her strategy the one least favorable to him. Although this is not likely, there is perhaps some 
justification for basing a theory of the best choice of § on this assumption, when nothing whatever is 
known about Nature's strategy. Even if the analogy with game theory is incomplete at this point, it 
nevertheless remains valuable, since it led Wald to the complete class theorem (Sec. 2.4). 


When the prior (e.g., signal) probabilities (o(S)) are completely known, the Bayes criterion is 
the basis for design: 


A Bayes decision rule $s is one whose average loss 
rating is smallest for a given a priori distribution ¢(S): 


Risod ya etsy: (2.10) 


The Bayes principle makes the fullest use of prior (signal) probabilities; in a sense it assumes the 
most favorable case. The two principles are appropriate in the extreme cases of no knowledge and 
full knowledge of these probabilities. For intermediate situations, where o(S) is partially known, 
see the discussion of Hodges and Lehmann [2.7]. We emphasize that when W(N) is not known the 
minimax principle can be used to supply a "least favorable" W(N) and corresponding Bayes test, just 


as it may be, to handle incomplete knowledge of a priori signal probabilities.. 


In either situation minimax procedures provide the necessary probability distribution with which 
to construct Bayes tests. See Sec. 3.4 following for an illustration, 


2.3. Comparison of Decision Rules 


The conditional loss rating of a decision rule depends, of course, on the particular signal 
present at the input. One decision rule may have a smaller rating than another for some signals, 
and a larger one for others. If the conditional loss rating of &] never exceeds that of §> for any 
value of S, and is actually less than that of $2 for some particular S, $1 is said to be uniformly better 
than § 2. This leads to the definition of an admissible decision rule: 


A decision rule is admissible if no uniformly better 
one exists. 


Note particularly that on this definition an admissible rule is not necessarily uniformly better than 
any other. Other rules can have smaller ratings at particular values of S. The point is that they 
cannot be better at all values of S. 


It follows from these definitions that if a Bayes or Minimax rule is unique it is admissible. The 
converse is not true, however; an admissible rule is not necessarily Bayes or Minimax. Thus, no 
system that does not minimize the average loss rating can be uniformly better than a Bayes system 
(for the same ©), and no system that does not minimize the maximum conditional loss rating can be 
uniformly better than a Minimax system. Admissibility is an important additional optimum property 
of Bayes and Minimax decision systems. 


It is clear from the definitions also that a Bayes decision rule whose conditional loss rating is 
constant is a Minimax decision rule. In many cases this furnishes a useful way of finding the Mini- 


max rule. ee vec, ° 


2.4. The Complete Class Theorem 


A class D of decision rules is completeif for any 6 not in D we can find a §* in D suchthat §* 
is uniformly better than 6. If D contains no sub-class which is complete it is a minimal complete 
class. 


Wald has shown [2.8] that in the risk formulation [( F = C(S,y)], the class of all admissible 
Bayes decision rules (i.e., for different 6's) is a minimal complete class [2.9] under the following 
(A) Fg(¥) is continuous in S, 
(B) C(S,y) is bounded in S andy, 
(C) The class of decision rules considered is restricted to either (Zeek) 
(i) nonsequential rules, or (ii) sequential rules, 
(D) S andy are restricted to finite closed domains. 


125 


On these assumptions also, any Minimax decision rule can be shown to be a Bayes rule with respect 
to a certain o(S),called the least favorable a priori distribution, and the existence of such a distribu- 
tion as well as the existence of Bayes and Minimax rules themselves is assured [2.11]. 


The complete class theorem thus establishes an optimum property of the Bayes class as a 
whole. For instance, we shall see in Sec. 3 that the Bayes test for binary detection is a likelihood 
ratio test. The complete class theorem applied here says that corresponding to any non-likelihood 
receiver (for example, most existing detection systems) there is a likelihood receiver which is 
uniformly better. The physical embodiment of such a receiver comprises a computer of the likeli- 
hood ratio and a threshold comparison device, The computer design is the same for all Bayes tests, 
so that the optimum property attributed to the Bayes class as a whole by the complete class theorem 
is realized in the computer design. 


No complete class theorem for the information loss formulation has been proved as yet. Some 
results on the characterization of Bayes tests with this measure used in hypothesis testing and 
extraction are givenin Sec. 5. 


2.5. Information and Sufficiency. 


Here we wish to make clear the connection between information loss and sufficient statistics. 
A statistic of the distribution Fg¢(V) = F(V|S) may be defined as any transformation of the observation 
points such as y(Y). An estimator of § is a statistic formed by mapping the points of [™ onto a deci- 
sion space like {2 The maximum likelihood estimator of S, for example, is constructed by giving to 
the estimator xX for each V, the value of S that maximizes F(v|S). 


The classical concept of sufficiency is based on the view that the parameters 5 governing the 
distribution exert a ''causal" influence on the value of the variate V which is more or less obscured 
by the ''randomness" of the distribution. A statistic y(Y) is said to be sufficient, roughly speaking, 
if knowledge of y is as good as knowledge of VY itself as far as determining the S "responsible" for 
both is concerned, Fram (1.2), we write 


S(ylV) = S(ylV.S) 
A little algebraic rearranging gives: 


«(S) F(v|S) = o(S)Bly|S) wide ; (2.12) 


where P3(y/S) is the conditional probability (density) of y with S fixed, andy (VI S,y) that of V with $ 
and y fixed. The only kind of knowledge about § we get by knowing Y is contained in the dependence 
of F(VIS) on S, V fixed, namely, the likelihood function. If we know y, but not V, we can reproduce 
this dependence (except for an unknown constant scale factor) from a knowledge of y alone, only in 
case »(V|S,y) is independent of S, i.e., if 


u(vis.y) =1Vly) . oars} 


With this, (2.12) becomes equivalent to 
P,(S|v) = P2(Sly) » (2.14) 


where P, (sly) and P,(s| y) are the conditional probability densities of S with V and y respectively 
fixed, Either (2.13) or (2.14) may be taken as the definition of sufficiency. Wheny is a sufficient 
statistic, specification of V in addition to y does not in any way improve our knowledge of S. The 
relation (2,12) shows that only distributions that can be factored into a product of two terms, such 
that one involves y and S only and the other V and y only, admit a sufficient statistic. The notion of 
sufficiency and the factorization condition were introduced by R.A. Fisher [2.12]. A recent treat- 
ment has been given by Halmos and Savage [2.13]. 


Closely related to sufficiency is the idea, due also to Fisher [2.14], of associating with an 
observation a numerical measure of the information it contains about the distribution parameter. 
The Shannon information measure serves the same purpose and is more appropriate for communica- 
tion problems. We define the loss of information about a particular S attending formation of the 
statistic or estimator y, from a particular observation V, as the difference of the uncertainties, or 


P (S| Y) 


lo g (eg MNS) 
° POSTY) 


126 


The expected value of this, or the average information loss, is 


H(o,$ ) { fay a § (yl V) Je (S|v) 1 “ely as (2. 16) 
Cr, = 4 
seks Sf xx. qey tOs PS(STy) dS , 
Ar 19) 
where 
=(V)) | ets Fv) ds . (2217) 


2 


The last integral in (2.16) is always positive or zero, and is zero if and only if (2.14) is satisfied, 
i.e., if yis a sufficient statistic Ze 15 ea lOs) 3 


3. Application to Detection 


3.0. Introduction 


In this section we consider detection mainly from the risk point of view, and examine explicitly 
the binary, or single-alternative cases. Some results of the information-theory approach are given 
in Section 5. 


The binary detection problem [1.21),[(1.25) is the problem of testing the hypothesis Hp: noise 
alone present at the input—against the alternative H]: signal plus noise present,—when there are only 
two points in decision space, namely, yo: the decision that noise alone occurs, and yj: the decision 
that signal plus noise occurs; y, andy are decided with probabilities § (Yo! VY) and §(y ly) re- 
spectively, and 


Syl ¥) + S(yjlwy=1. (3.1) 
A cost is preassigned to each possible combination of signal and decision as follows: 


C(S=0, y,) = C) 


Ee C(S=0, y)) = C, 


(3. 2) 
C($#0, y,) = Ce C($#0, y)) = Cy _p 


where C, and Ca, are the costs of the Type I and Type II errors mentioned previously, and Cj_,4 and 
a tod ay abe ae oa ; b 
Cj-q are the costs of correct decisions, These are all finite positive quantities with 


CET Ge ; Cy>C (3. 3) 


a 


1-2 


as required by the nature of the problem. We also specify that W(V-S) is continuous in S$, and con- 
sider nonsequential tests only, thus fulfilling all of the assumptions of Sec. 2.4. 


We next consider the signal space Q2to contain infinitely many points besides S = 0, with o(S) 
taken for convenience as [1.21] 


a(S) = q §(S-0) + w(S) 


woy=0 is p= J wis) as 
2 
par ge b's (3.4) 
Thus the test is against a one-sided alternative, the simple alternative test appearing as the special 
case 
w($) = p 8($-S)) , (3.5) 


where Sj is the single possible signal. 


Note that no specific assumption is made as to the way different signals are distinguished in 2. 


27 


They may all be of the same form with different amplitudes or epochs or both, or they may differ in 
other respects. This detail is reflected in the expression used for w(S), which is a one-dimensional 
distribution if amplitude alone varies, two-dimensional if both amplitude and epoch vary, n-dimension- 
al if the values of n physical quantities are used to specify the signal completely in the interval (0,T), 
etc. The following discussion includes all of these. 


3.1. Characterization of Bayes Detection Systems. 


To characterize the class of Bayes tests, we form the expression for the average risk and mini- 
mize by a choice of S(yol_V) or S(y, 1M). The conditional risk depends on whether S = 0 or sf Ois 
present at the input. From (2.5) we write 


r(S,§ ) = {fees W(V-S) S(ylV) av ay 


ET 
Go (1a) SO 

ay as pace (3. 6) 
BC, +(1-B)Cy_ » SF 0 


where a and # are the (conditional) probabilities of Type I and Type II errors respectively: 


< [won sa) av 
r 


a= 
(3.7) 
pz Jwiw-s) S(y,|W) av. 
The average risk takes account of a(S), given by (3.4). From (2.6) and (3. 1) we then have 
R(o,8) = qa, tq(l-a)C;_, + A'C, + (P-B')C)_4 (3,8) 


-| {a C WIV) +Cy_,<WIV-SP + Siy,|¥) [WY -S)7 (Cg -C) 4) - aWWNCy-Cy_gfav, 


(329) 

where now <W(V-S)> = { W(V-S) w(S) dS , (3, 10) 
Q 

Bie Jowcy-sy> Sty,1¥) av. (3. 11) 

p 
Note that in the simple alternative case (3.5) 
<W(Y¥-S)> —> pW(Y-S)) » 
(Bal) 


B ‘'—> pB 


The condition for minimum R(6,§) = R(o, §, ) is now evident from (3.9). Let: 


cee Se A? (er) 


gq 
= Ca ms 
Ke Ce Ci, C (3. 14) 


Then the Bayes decision rule is: 


Let §(y|¥) = 0 (i.e., decide y,) when TANS SAR 


(3.15) 
Let S(y,|V) = 1 (i.e., decide y,) when A. << K 


The Bayes decision rule therefore divides the observation space I" into two regions separated by the 


128 


V's satisfying W/L = the threshold K, viz., the acceptance region ["' containing V's for which cA; <a; 

and the critical region fF" in which A> K. Note that the division is different for different ¢'s. The 
Bayes class of decision rules includes all rules of the type (3.15) corresponding to different a priori 
signal and noise probabilities. << 


Since (3.13) is a generalization of the likelihood ratio* this means that Bayes decision rules 
are nonrandom likelihood ratio tests with a threshold K depending on the preassigned costs. Ifa 
Bayes test is unique, it is admissible, which means here (according to (3.6)) that bothof the error 
probabilities a and f cannot be smaller for any nonlikelihood test than they are for a likelihood ratio 
test. The complete class theorem then says that given any nonlikelihood ratio test one can always 
find a likelihood ratio test for which at least one of the error probabilities is less than, and the other 
is not greater than, those of the nonlikelihood test. 


The likelihood ratio test takes on added significance when the Bayes risk (3.9) is rewritten as 


R(o,8 ) = | f(V) [fy (VY) S(y,1¥) + Pyg(¥) Sty | Y)] ava; (3. 16) 
ie 
where 
£(V) = fas) W(Y¥-S) dS = qW(V) + Cw(V-s)? (3.17a) 
Q 
Aii(¥) = C, P.(S=0/ v) + Ci, P_(S#0| V) O(se ith) 
Avo(¥) = Cg P(S#0]v) +C)_| PL(s=oly) , (3..17c) 
and where we have used the relations 
qW(V) = £(V) P. (S=0[Y), (3. 18a) 
<W(V-S)>= £(¥) P(S#0] Vv), (3. 18b) 


Here f(V) is the total probability of V's occurrence, and P (s=0|v) the posterior probability of S = 0 
when V is given, etc. Thus, for example, (3.18a) gives alternative expressions for the joint prob- 
ability of S=0Qand Y. The quantities Pyo(V) and fPy)(V) may be interpreted as the a posteriori risks 
of making decisions y, and yj, respectively. The decision rule (3.15) which minimizes the Bayes 
risk thus also makes the decision for which the a posteriori risk is least. 


Similarly, if we note that the logarithm of the likelihood ratio is proportional to the Shannon 
measure of the difference between the uncertainties about Hj and H] when V is known: 


" P (S#0\yv) 
logJ\. = log “MSY 108 BTS=0TY) (3.19) 


we observe that the decision rule (3.15) amounts to deciding in favor of Hj when the uncertainty 
about H] is less than the uncertainty about Hy by an amount log K. When there is only one signal 
(simple alternative case), which is distinguished from zero signal by an amplitude scale factor a), 
it is easy to show that the average of the uncertainty difference (i.e., the information difference) is 
proportional (3. 1] to aZ, when ag is small. 

3.2. The Neyman-Pearson and Ideal Observers [1.13], [1.18], [1.21] 


The classical Neyman-Pearson test of an hypothesis against a single alternative fixes the 
probability a and minimizes the probability . This is accomplished by a likelihood ratio test with a 
certain threshold which depends ona. To show this, one may use expressions (3.7), minimizing# by 
variation of the boundary between the acceptance and critical regions, subject to a constraint of fixed 


*Here and elsewhere we use "likelihood ratio" and/. to refer to the ratio of the joint probability of 

S #0Oand Vtothat of §=0andV. This differs from the classical use of the term (as the ratio of the 
corresponding conditional probabilities) in connection with tests not concerned with the prior prob- 
abilities of the hypotheses. Note that with our definition./is essentially a ratio of a posteriori 
probabilities. (See Eq. 3.19). a 


129 


a[{1.18]. Since the Neyman-Pearson test is a likelihood ratio test, it may be interpreted from the 
risk point of view as a Bayes test for certain cost assumptions. Fixing a and minimizing A is thus 
equivalent to assuming a certain ratio of costs (depending ona, i.e., K = K(a)), and minimizing the 
average risk. The smaller the allowed a (false alarm probability) the higher the threshold required; 
or in risk terms: the smaller a is the larger must be the ratio of costs preassigned to false alarms 
and false rests. 


Another way of designing a single alternative test is to require that the total probability of error 
(qa+tpf) be minimized. An observer who makes decisions in this way is called an Ideal Observer. In 
a way similar to that used for the Neyman Pearson test, this also may be set up as a variational prob- 
lem (with no constraint) resulting in a likelihood ratio test with K = 1[1.18]. Thus it may be thought 
of as a Bayes test with a cost ratio of unity. 


The fact that these are likelihood ratio tests follows from the optimum performance they re- 
quire; since they are likelihood ratio tests, they belong to the Bayes class from the risk point of 
view, and therefore share the general optimum properties possessed by that class. 


3.3. Decision Curves 


The Bayes risk itself may be taken as a figure of merit for optimum system performance. In 
the simple alternative case, when it is known that there is only one other signal besides § = O0in Y2, 
and it is characterized by a fixed amplitude scale factor a,, one can define the minimum detectable 
signal (amplitude) in a way analogous to the betting curve procedure introduced by Siegert. The 
Bayes risk for detection depends on the signal amplitude, naturally being less for larger amplitudes. 
Thus, a functional relation between these quantities may be used to find the smallest signal amplitude 
for which the risk does not exceed a certain value (assumed beforehand as a criterion for the mini- 
mum detectable signal). Here a andf may be calculated from the formulas:[1.18, 1.21] 


logK 
ait = dy | W(V) Sly - log Aly)) ay, (3. 20) 
roe) if 
fore) 
B=1- dy Jwoe-a,s &ly - log A(y)) av (63721) 


log K fe 


where S$ is now normalized to place the amplitude factor a, in evidence [1.13], and Gis the Dirac 
delta function. Here/l(V),or any monotonic function of:/.(V), may be used in place of log. /AL(V) for 
convenience and K is simply a threshold value. 


Figure 2 shows normalized Bayes risk curves of this nature, calculated for Rayleigh statistics. 
Observation space in this example is the space of all values of the envelope of the mixture of signal 
and noise, or noise alone (See Sec. 1.1). Thus, W(V) and W(V-a,S) in (3.20) and (3.21) are replaced 
by the corresponding Rayleigh distribution functions for noise alone and signal plus noise. It is as- 
sumed that n observations of the envelope are made in (0,T) at a repetition period large compared 
with the correlation time of the input, so that the samples are uncorrelated (3.Z}. For the common 
cases of small a, and large n, (3.20) and (3.21) yield approximately: 


1 i asa log(K}x) 
So Jb oa Gy) = SP > 5 Oe? 
a=5 & aaa TTR ( ) 


Des 
Beery oe a Jogi) (3.23) 
a 2vZ al Yan 
where 4= p/q. Here 

Zz 

2 

@(z) = (ann | a dt, as usual. 

© 


Figure 3 summarizes these relations fora+tAM< 1. For ae Van fixed, the values of a and are 
interchanged by reciprocating/\/“ . Botha and @ are decreased when ag V2n is increased with 
Aju fixed. Specification of any two of the four quantities a, 8, a&,/2n and. /u fixes the other two. 


130 


Thus, for example, ana of 0.15 and a ® of 0.20 may be obtained only with A/n = 1.2 and a V2n=2Z.7. 


The normalization of the Bayes risk curves of Fig. 2 is accomplished by noting that 


RUA, & ) > 1er. o ie. as aso , (3.24) 
and as ag 0, 
qGi_, TPS 4.o when Kj. > 1 
Riu, & ) — (32.25) 
qC ca : when Kjr< 1. 


Thus, with the help of (3. 24), (3.25), and (3.8), the normalized Bayes risk curves are here defined by 


Riu, &, ) = Iq Cae + pC, _g) 


c aa es 
Se cLermear XV ERa aces eas ca 


R(4, &, )-(qC)_ nf PC, _,) 


= * - K 
ta, bu) = @oept, Tey. FEC] gl a +EA 2 a 1) (3. 26b) 
Ru.h,)=atA (=). (3,26) 


Due to the symmetry of the pairs (3.22), (3.23) and (3.26a), (3.26b), the normalized risk for a given 
ag ./n is the same for Klu and “/K when these have the same value. 


The curves of Fig. 2 show that the minimum detectable signal, corresponding to a fixed fraction 
of maximum risk, is smallest when the cost ratio K is equal to the ratio“ of prior probabilities. 
Thus when 4 is one, the Ideal Observer, who takes K = 1, minimizes the risk. The Neyman-Pearson 
tests (K#1) for fixed “ and fixed risk yield smaller minimum detectable signals as the cost ratio is 
increased when K</¥J , and larger ones as it is increased when K>A . 


Note from Fig. 2 that when & # 1 certain Neyman-Pearson tests can result in a smaller mini- 
mum detectable signal than the Ideal, if normalized risk is taken as the criterion, and when it makes 
sense to compare the two types of observer. As an example, suppose “= 1/4 andaé-~V/2n= 2, An 
Ideal Observer always takes./.= K = 1, which in this case (from Figures 2 and 3) yields a normalized 
Bayes risk of 0.79 with a = .045 and @ =0,61. Onthe other hand, a Neyman-Pearson Observer who 
holds the false alarm probability a at 0.1 and minimizes, needs K/M = 2.3 and obtains a minimized 
8 of 0.45 (from Fig. 3). The latter's normalized Bayes risk is 0.68 from Fig. 2, and thus is 
smaller than the Ideal Observer's. However, (assume Cj_q = Ci- So i & C,/Ca) the Ideal Ob- 
server with K = 1 weights Type I and Type II errors equally, while the Neyman Pearson Observer 
with K = 2, 3/4 weights Type II errors more than Type I errors. The unnormalized risks here, given 
by R(x Su Ve pCa (yu, Su ), are 0.16 Cg, for the Ideal, and 0.14 CAnp for the Neyman Pearson. 


The comparison thus depends on the relative costs assigned to the two types of errors, by the 
two observers, i.e., if CA; = CANp (so that Cay =(4/2. 3)Canp): the Neyman Pearson risk is lessthan 


the Ideal, but if Cay = Canp (so that Cay =(2. 3/4) C Arp) then the Ideal risk is less than the Neyman 


Pearson. The two observers may be compared on some other basis than risk, of course, such asthe 
probability of a correct decision (= 1 - qa - p4)[33]. For the example above, this is 0.842 for the 
Ideal and 0.830 for the Neyman Pearson, so that the Ideal is better in this respect, as it must be by 
definition, when percentage of correct decisions is chosen as the measure. 


3.4. The Minimax Detection Rule 


The theory of Sec. 2.4 shows that the Minimax rule provides the likelihood ratio test corre- 
sponding to a certain least favorable distribution o(S). To find the latter we take advantage of the fact 
that a likelihood ratio test with the same conditional risk for both hypotheses is Minimax, 


As an example, let us now apply the procedure in the simple alternative case, where p and q 
(= 1-p) are unknown. We begin by varying p and q, calculating a andf# for each likelihood ratio test 
thus defined, until an a and are found for which the two risks of (3.6) are the same, i.e., 


aC, + (1-a) C = BG ARSENE ae s (3227) 


l-a 


The minimax p and q (i.e., Prasodinsc) for which the right combination of a and 9 (i.e. | ni oane) oc- 
curs is then the least favorable a priori distribution. 


aksyil 


No other test can give a smaller maximum conditional risk than the Minimax. Sincetheaverage 
risk cannot exceed the maximum conditional risk, the Minimax rule also has a smaller maximum 
average risk than any other (as p and q are varied). Of course, for some particular p = p' andq=q' 
another test might have smaller risk than the Minimax, for the same p' and q'. On the other hand, 
there would be a p and q for which it has a larger average risk. The Minimax rule thus has the ad- 
vantage that it guards against the worst case, and the disadvantage that in doing so it admits the pos- 
sibility of being bettered in particular cases. 


Figure 4 shows some Minimax calculations for the example discussed above, with Cj_q = Ci-2= 
0 and Gg = 100. 


Because of these cost assumptions the curve labeled Minimax coincides with the Bayes risk 
curve for A,, =M= 1 (i.e.,4 = 1 is the least favorable distribution here). The curves for other 
values of A are Bayes risk curves, where o(S) is known a priori. 


When ad Vn is l and # = 8, for example, we note that knowledge of reduces the risk by 20 
units, Alternatively, if the minimum detectable signal is here defined as the smallest one for which 
the risk does not exceed 10, we find that (a€ WO yeaie is 1,40 when is known and 2.55 when is un- 
known. For fixed sample size n this means that the minimum detectable signal is increased by 
10 log (2.55/1.40)= 2.61 db by lack of knowledge ofM. If the amplitude a, is fixed, on the other hand, 
the integration time is increased by /2.55/1.40= 1.35, or by 35 per cent. 


3.5. System Comparison 


A likelihood-ratio receiver is in general a rather complex computer [1.13, 1.25]. It is im- 
portant, therefore, to be able to compare the performance of a compromise, nonoptimum system with 
the optimum, so that the cost of the compromise may be a factor in system planning. Following 
ref. [1.21], Sec. 2, we observe in binary detection that just as the optimum system computes the 
likelihood ratio and compares the result with a certain threshold K, we may assume that the non- 
optimum system also computes some quantity F(V;S) for comparison with a threshold K, deciding 
"signal plus noise" when F(V;S)> K and ''noise alone" when F(V;S)< K. If the system function 
F(V;S) is known, the probabilities ajandf, may be calculated from (3.20) and (3.21) with F(V;S) in 


on i] 


place ofA, and with the same threshold K. 


As an example, we consider the coherent detection of a signal (of fixed amplitude) in Gaussian 
noise with continuous sampling in (0,T). Middleton [1.13] shows that the likelihood ratio becomes for 
this: 


log/s =a, Bs) - Fag Fle8) (3.28) 
T 
with b(v,s)= v| v(t) X(t) dt, (3. 29a) 
0 
au 
s(t) ; X(S) K(t-s) ds , (O-<t <)>), : (3. 29b) 
0) 5 


Here s and v are the normalized quantities: S =a, /s;V = //v, where i is the mean square noise 
amplitude, and K(t-s) is the auto-correlation function of the noise. When (3.28) is used in (3.20) and 
(3.21), the resulting a and are exactly 


aio 
on} {1 o[-3o bali (3-304) 
s S torn. log (Kia) 
Ap efi 0[°3 eal ; . (3. 30b) 


where 


oh 2 Blass (3. 30c) 


For the specific case of asine wave of angular frequency %), coherently detected with epoch €,=0, 

in broad-band Gaussian noise, shaped by an RC filter, [(RC)-1= %,], we find that for a nonideal 

system which treats the noise as having an infinitely wide spectrumi.e., no coherence from sample 
point to sample point, the minimum detectable signal,(ao@) minis 4.0 db higher than that obtained by 
the limiting or ideal system when the correlation in the noise is taken into account; [(a,*) nin is chosen 
as age for the 90% levelof successful decisions; p = q = 1/2.] See Fig. 8 and( «’ ), Sec. 4 of reference 


132 


[1.13]. Here F(V;S) veplacesyvilin (3.28), or equivalently, (s,s), Eq. (3. 30c) is replaced by the 
corresponding expression (s,s) in the nonideal situation. 


4. Application to Extraction 


4,0. Introduction 


In this section we consider extraction as the counterpart of parameter estimation. Point esti- 
mation rather than estimation by confidence intervals is treated. Our main object here is to show 
how some of the methods commonly used for optimum extraction appear from the risk point of view[4.]} 


As before, we let y denote the decision to be made about the signal S, and observe that when yis 
to be an estimate of S, the spaces Gland A of Figure 1 have the same structure. We also assume that 
each contains a continuum of points and is a finite closed region which may, however, be taken large 
enough to be essentially infinite for practical purposes, 


The cost function C(S.y) to be used in the risk analysis is, of course, to be preassigned in ac- 
cordance with the external constraints of the problem, and is critical in determining the nature ofthe 
resulting system. A theorem of Hodges and Lehmann [4.2] says that 


if Ais the real line and C(S,y) is a convex function* of y 
for every S, then for any decision rule § there exists a non- 
randomized decision rule whose risk is not greater thanthat 
of § for all Sin &. 


The squared-error cost function C(S,y) = (s-y)- is suitable for our purposes here, since it leads to 
conventionally used extraction procedures, and is also convex, so that the inconvenience of consider- 
ing both randomized and nonrandomized rules may be avoided, at least inthe one-dimensional case. * * 


A further simplification results from the fact that a nonrandomized decision (e.g., one for which 
is either 1 or 0) rule, may be written as: 
S(y]¥) = S(y-y,(¥)) (4. 1) 
where the & on the right is now the Dirac delta function. Here it is essential to distinguish between 
the estimate y and the estimator y,(Y). The latter denotes the functional operation performed on the 


data V by the system, while the former is simply a value of the output. The operation y,(Y) is thus 
the quantity to be optimized. 


4.1. Characterization of Bayes Extraction 


To find the Bayes estimator, we begin with the conditional risk. From(2.5) and 


r(S.y,) = [fee F (VY) 8(y-y¥,(W)) 4¥ dy (4.2) 
i A 
= | ctsy,00 (NM) ay (4. 3) 
Fe 


A useful alternative form, analogous to (3.6), is obtained by rearranging (4.2): 


*A real-valued function (x) is convex in an interval (a,b) if for any x and y in (a,b), andany number 
0 <AK< 1, Ay(x) + (1-A) Hly) >y lax + (1-Aa)y]. 

** When S and y are multidimensional vectors, (S-y)2 is to be interpreted as the length of the dif- 
ference vector. Although the theorem above applies only to one-dimensional § and y, it can usually be 
extended to include multidimensional vectors [2.2]. We shall retain the vector notation in the follow- 
ing for illustrative purposes, noting that this extension is necessary for the validity of results in the 
multidimensional cases. 


133 


r(S,y,) = frwls C(S,y) dy (4.4) 
van 
where * 


P3(y]S) J eo S(y-y ,(¥)) dv . (4.5) 


The latter is the probability (density) of making an estimate y when the signal is S, and is thus an 
error probability analogous to the error probabilities a, A in binary detection, cf. (3.7]}. 


Next, the average risk for an a priori signal distribution o(§) is expressed as: 


R(o.y,) = {J [s-y(w]? os) F (Wav, (4.6) 
oy TT 


for the squared-error cost function. Since 


o(S) FY) = f(y) P,(SlY) (4.7) 
this may be rewritten 
R(o,y.) = [ev av | [s - y,(w)]° P\sly) gs . (4.8) 
[py 2 j 
The y,(V) for which this is smallest is the Bayes estimator, denoted by y,(Y). The second integral 
is a minimum for fixed V if y,(¥) is chosen as 
Ye min Ye) =[ $ Pysly) as (4.9) 
82 


PDA i! eae SS (4.10) 
fas F(Y) aS 
2 
Thus, the Bayes estimator (for the cost functions of (4.6)) is the conditional expectation of S given Y. 


The conditional risk (4. 3) becomes the variance of the estimator when the cost function is 
squared-error and y,(V) is unbiased for every S, i.e., when 


fy F(Y) dV -$ . (4.11) 
In this case the average risk (4.6) may be written as 
R(o,y.) = | Var. y,(¥) ¢(S) dS. (4, 12) 
Q 


Thus, the Bayes estimator, which minimizes R(¢6,y,) is a minimum variance estimator for every S. 
However, the Bayes estimator (4.10) is unbiased only for certain distributions. 


When signal and noise belong to ergodic processes, the average risk (4.6) may be written as 
as 
aan 2 
RS ys) pees “al { s(t) ave [vit)]} dt. - (4,13) 
0 


*Note that P,(S|V), P, (Sly) and P,(y|S) are all different functions, 


134 


The conventional treatment of extraction uses the minimum value of this as an optimum criterion, 
just as the risk formulation does when the cost is squared-error. Usually, y, is required to be the 
output of a linear, physically realizable filter with V(t) at its input. Since (4.10) is not generally 
linear in V, optimum extractors under this constraint are not necessarily Bayes. We observe, how- 
ever, that some of the ideas of risk theory are useful for such restricted classes of decision rules, 
even though the main theorems do not apply. That is, if we agree that only the class of linear esti- 
mators is to be considered, we may speak of the one with smallest average risk, the one for which 
the maximum conditional risk is smallest, the one with the property that no other is uniformly better, 
etc., settling the questions of existence and uniqueness in specific cases by construction. 


Further detailed properties of Bayes and Minimax extractors, with results for reception prob- 
lems of practical interest, are reserved for later presentation. 


4.2. The Maximum Likelihood Estimator 


A most useful method of obtaining estimates is furnished by the principle of maximum likelihood, 
which takes the S(=Syyj,,) that maximizes the likelihood function (i.e., Fg(Y¥) regarded as a function of 
S for fixed (i.e., given) V) as the best estimate of the actual S present at the input. 


Now, it is well known that when a sufficient statistic exists, the maximum likelihood estimator 
depends on it alone [4.3]. Its further significance is easily seen from (4.5). If (4.5) is written for 
y=S, we have 


P,(y=S|S) = ipa &(S-y (¥)) dv, (4.14) 


which is therefore the probability of a correct decision when the signalis S. It is clearly largest if 
for each V we choose y,(V) equal to Smu- The conditional risk (4.4) is the sum of the products of 

the various error probabilities and their costs. Since the cost of a correct decision is always less 
than any other (by definition), the maximum likelihood estimator, by assigning the largest probabili- 
ties to the smallest costs, minimizes the risk if certain symmetries are present in F,(Y) asafunction 
of S and in the cost function C(S.y). Wald [4.4] shows that the maximum likelihood estimator ofa one- 
dimensional parameter $ = S minimizes the Bayes risk when the cost depends only on the difference 
\y-S] , when the a priori distribution o(S) is a constant, and when F,(V) = W(Y¥-S) is symmetric about 


SML:° 


By taking the average of each side of (4.14) with respect to ana priori signal distribution o(S), 
we obtain the average probability of a correct estimate as 


[ls o(S) dS = fey {(V) [ P(s|¥) &(S-y,(¥)) ds. (4.15) 
2 i 2 


The same reasoning that led to the maximization of (4.14) shows here that this is largest when for 
each V, yg(V) is chosen as the particular value of § that makes the posterior probability P}(Sly) 
largest. The Woodward and Davies receiver [1.19] presents P;(S|V) as a function of Satits output. 
We see, therefore, that if this is made into a "decision" system by taking the maximum value of the 
output as the estimate, the average probability of a correct decision is maximized. However, this 
procedure might be criticized,it ignores the possible importance of errors, i.e., failures of y to 
equal S in one or more of its components, whose measure in the risk formulation is the cost function 
C(S.y). 

The problem of estimating the amplitude of a small signal in additive noise furnishes aninterest- 
ing application of the maximum likelihood method. 


Assume that the form of the signal is known, that only one signal is possible, and that V and $ 
are normalized: V= Vy; S = Vagos, with py the mean square noise amplitude. Expansion of 
log W(v-aos) in powers of ag, according to the procedure of Middleton [1.13, 1.21], yields the follow - 
ing, if we regard the terms in a, and a,“ as an adequate representation of the distribution in the 
threshold case: 


log W.(V-S13..-5V,,-S, a,) = BO) as vg op) wep ys.. J (ante) 


~~ 


af 3 als avon} 


fae ee sarge 


a square matrix, and all depend in general on the means, 


1) (2), 


Here pl) isa scalar, B! a vector and B 


135 


variances and all higher moments of the noise distribution. The maximum likelihood estimate for this 
representation of the likelihood function is easily shown to be 


s 2 VosB") uF 2¢B!*) 5 (4.17) 
°ML 28B‘"! s 


in this approximation. Substitution of this back into (4.16) gives 


W(V-S) 2 fexp[-2ao.,7 ag 8 Bis + war SBI 5 Tf exp (BO + vp eB +yZB vy]. (4.18) 


Since the distribution factors into one term involving the statistic Aomi, and the parameter aj, and 
another independent of the parameter, the estimator ao,,, is a sufficient statistic, [4.3 etSec. 2.5% 
When the noise has a Gaussian structure 


Se 
BO) =vGK © 
pl?) at, IG ; (4.19) 
where K is the variance matrix and the result (4.17), (4.18) is exact. 


4.3. System Evaluation and Comparison; Analogies between Detection and Extraction 


Here, as for detection, systems may be evaluated and compared in risk terms. Whenasystem 
performs both detection and extraction, one may consider the corresponding two risks as components 
of the total risk associated with '"'complete" reception of the signal. Thus, the extraction risk of any 
given system may be calculated from (4.2) with the system function in place of y,(V). 


Figure 5 briefly recapitulates the risk formulations for detection and extraction. For each,the 
conditional risk is the sum of the various error probabilities for a given S(P3(y1S)), weighted accord- 
ing to error cost C(Sly). The average risk is the expected value of this in view of the prior signal 
distribution ¢(S). The essential difference between the two appears in the calculation of error 
probabilities. For extraction this amounts to a simple ''folding" of the distribution F(VIS), dnc@iey a@ 
selection of all the V's that lead to a given decision y and a summation of their probabilities of oc - 
currence. For detection there is seen to be a similar folding operation, with the likelihood ratio 
ALY) taking the place of the estimator y,(V), followed by an additional summation over all values 
of,/,(V) above or below the threshold K. Thus detection essentially involves an extraction type of 
operation which maps observation space [" onto the real line (domain of y) with subsequent division 
of this domain into two parts at the threshold value K. (See remarks in Sec. 1 on this connection 
between detection and extraction. ) 


5. Connection Between Information Loss and Risk 


5.0. Introduction 


In this section we show how, as a special case of decision theory, the Shannon measure of in- 
formation loss may be used as a criterion of performance for detection and extraction systems. Sys- 
tems that minimize information loss are described, and some of the relations between the minimum 
information loss criterion and the minimum risk criterion are pointed out. 


As mentioned in Sec. 1.1, the general formulation from the point of view,information loss and 


risk are the same except that the cost function C(§,y) of the latter is replaced by the uncertainty 
-log P2(S|y). The average information loss, or equivocation, is thus given by 


H(o-,$) = - {ff log P,(S| y) +(S) F, (Y) 8(yl[V) dS aV dy (5.1) 
QE LN 


where again § (y[V) is the decision rule, assumed to be nonrandom. Since the value of P»(S|y) for 
given § and y depends on the decision rule in use, while that of C(S,y) does not, the decision rule that 
minimizes information loss is harder to find than the one that minimizes risk. 


136 


5.1. The Information Loss Criterion for Detection. 


To specialize the expression (5.1) for (binary) detection, we assume first that the (nonrandom- 
ized) decision rule divides observation space [" into two regions, here denoted as [", and i }» and that 
the decision yg (noise alone) is made when V € ea (€= "lies in'' or "belongs to!''), ane the decision y], 
(signal and noise) when Ve ["}. Thus we write 

Sly [vel = $5, . ig = 0,1, (5.2) 


where $i; = 0 (idj) or 1(i=j). With this, (5.1) becomes 
H(«,&) = - [es } Pxly,| 8) log P,(S|y,) ne (y, 1s) log P, (s| ”)| ds. (5. 3) 
Q 


The equivocation for any given binary detection system may be calculated from (5.3) and used 
to judge its performance [5.1]. As an example, let us suppose for the moment that either S=Sp5= 0 
or S = $} # 0 can be present at the input with a ae Fee EG o(S,) = q and o(Sj) = p (simple 
alternative case). Then o(S) equals «(S.) §(S-S ) + «(S)) §(S-S)), ASer (5.3) becomes 


H(,5) = -> oS.) P3(y,/S;) log P,(SiJy;) » inj = OI. (5.4) 
te 
P, and P3; may readily be expressed in terms of the error probabilities a and, as follows: 
= 1 - = 8 
Piyleo) = lisa - a Payal ogee 
| = a 

P3(¥,(3,) =a ? P,(y |S) = 1 - 8 ? (5m) 
PSl¥,) = qa ENS vg) aes 
C=O oO) q(1-a)+pA : (axe) SAL qatp(1-2 ite 

PA 2 aos -B) 


Figure 6 shows how the aed Ah, of Bayes tests depends on the cost ratio K and ag di for 
/= 1 (and Rayleigh statistics). These curves are calculated, with the help of Figure 3 to find a and 
2, and by evaluating (5.5) and (5.6) followed by substitution into (5.4). In these circumstances we see 
that the Ideal Observer (K=1) loses the least paloma es for a given signal amplitude and integration 
time. As the cost ratio Kis rated for fixed age ~/n,the minimum at K = 1 appears broad; in fact, the 
information loss for any fixed a, 2./n does not vary by more than 0.2 bit as K is changed from 1/16 to 
16. These curves may be used to define a minimum detectable signal (for threshold detection) in the 
same way as the Bayes risk curves are used. Accordingly, if 0.2 bit is taken as the largest allowable 
loss, the minimum value of ag’ Vn is 375 for K =slvand 4-25 form@= 6, l/l6.) smoretedesamapillemsuze 
n this amounts to a difference of only 0.54 db in the amplitude of the minimum detectable signal. Cor- 
respondingly, for fixed signal amplitudes, the change in integration time is onlyV4. 25/3.75= 1.06, or 
6 per cent. For very small signals the equivocation approaches 1 bit, so that the system does no 
better than one who guesses on the basis of the a priori probabilities. For large signals or integra- 
tion times, on the other hand, the equivocation approaches zero,corresponding to the increasing cer- 
tainty of a correct decision, 


Let us now return to the general expression for equivocation, (5.3), and seek to minimize it by 
choice of the decision rule, i.e., here by choice of the boundary between regions [y and [), without 
specific assumptions as yet Sout the prior signal distribution ¢(S). The probability functions P, and 
P3 may be written 


P,(Sly,) -| Psly) P(viy,) av. (5.7) 
ei 

P,(y,|S) at EAS) AN. ¢ (5.8) 
ip 


i 
Here P,(V{y;) is the probability that V was responsible for the decision y;. Since the decision rule is 


137 


nonrandomized, every V in C5 leads to the decision y; with the probability 1, and the V's outside of r; 
cannot lead to y; at all. Thus P4(¥ 1 yi) is constant throughout [";, equal in fact to the reciprocal of the 
"volume" of [‘;, since Py must be properly normalized. Thus, 


P,(V/y 4) = Uiragle (5.9) 


si 
P,(8| y,) = = P(s{v) av. (5.10) 
LZ 


where we have let pe stand here for the volume of the region as well as for the domain of V included 
within the volume. Now denoting by V' a point on the boundary between [", and [), and letting I”, be 
increased to [’) + dJ", by change of V' to Y' + dV', we find for the derivatives involved in the mini- 
mization: 


) ia ) a 
ay P3(Vol S) ay eae) eee (5. 11a) 
P, (S| V') 
C) [nla 1 
ari log PS ya 2S, (Ge11b) 
7 ee ly _) 
he z 2 Re: Yo Mo 
o jlog P(Sly,) en )s oe (5. Llc) 
ayes P,(Sly) = = EER tT alte 
MG Cd | Ea IN) a ala 
The derivative of the bracket in the integrand of (5.3) then becomes 
PASIV (FSV) Fey) ; P,(y,|8)_ P3(v 418) te tae Po(Sly,) ie 
«($) Ga ry Fy Le aN Si yy) ; 
where we have used the relation 
6(S) 
P,(S\y.) = => 5 Py. |S ; S=3 
Sly;) = pay Pall) (5.13) 
Here P.(y;) is the (total) probability of making decision Yq> given by 
P.(y,) - | eo,h o(§) dS. (5, 14) 
2 
The integration over Qlindicated in (5.3) causes the first four terms in (5.12) to cancel, leaving finally 
P,(s|y_) 
S AIX 
oe fas) Paty il iogts eo) dS (5.15) 
A s 2 3 Ab me 
2 


This is the condition for minimum (or maximum) equivocation, i.e., the boundary between bas and Ly 
must be such that this relation is satisfied. 


The results for the one-sided and simple alternative tests are obtained from (5.15) by suitable 
specialization of «(S). For the one-sided alternative we take as before o(S)=q §(S-0) + w(S), which 
yields (with FS (Y) now replaced by W(V-S) in the case of additive noise): 


Sues Ci log eae oe) (5. 16) 
qwi(v ) 2 — 1¥ 
where 
i P,(S[y,) 
<w(y! $7 (5 Ly) - pee ES pec (5.17) 


Equations (5.16) and (5.17) show that the optimum (or extremal) division of observation space is 
achieved here by a generalized likelihood ratio test in which W(Y'-S) is averaged with respect toa 
distribution w(S) log [P2(Sly5)/P2(s| y))] , which itself depends on the optimum (or extremal) division. 


138 


For the simple alternative case we have a(S) = q &(S-0) + p$(S-S}), so that (5.15) becomes* 


pW(Y'-S) _ (5.18) 


qWw (V' By 
where 
log Zo 
and | 
Pi(S=oly_) 
a © (5. 20a) 


ney te 
70” PIS=oy,) > 


_ FZlSal¥6) (5. 20b) 
2 ERT 
The values of V' satisfying equation (5.18) define the extremum boundary between Bee and ee BOA SUG 


"noise alone'' is decided whenever V falls within [", and ''signal and noise'' when Y falls within 1, the 
information loss is a maximum or a minimum for this division. 


Z 


Note that Equation (5.18) defines a likelihood ratio test of the same type as the Bayes test for 
the corresponding problem in the risk formulation. Thus tests that minimize information loss (when 
they exist) belong to the Bayes class and are equivalent to minimum average risk tests with special 
cost assumptions It is therefore possible for a system to be optimum simultaneously from the stand- 
points of both risk and information loss. 


To show that there are solutions of (5.18) that minimize information loss, we may use (5.6) to 
express Zo and z of (5.20) in terms of p,q, aand8&. With / = p/g the result is 


2, -4 (12) (G42 ) (5. 21a) 
ghey i) se 


It may readily be shown that fora +/< 1 (the case of ordinary interest), Zo and z) are both greater 
than unity, so that the likelihood threshold Ky; is always positive. [We note that interchange of a and 
3 and of p and q interchanges zo and z}, thus inverting Kyy. ] 


The universal curve of Figure 7 shows the relation between a,/?, and Kyy for “= 1, exhibiting 
this symmetry. We see that the existence of a minimum information loss test depends on whether the 
statistics of the problem admit a likelihood ratio test with a and related to the threshold Kyy, as 
shown in Figure 7. If, for example, Figure 7 is superimposed on Figure 3, which gives the charac- 
teristics for Rayleigh statistics, we observe that there are no combinations ofa, / and K(for & = 1) 
that fit both at once,except for those alone the line a = , K = Ky = 1. Figure 6 shows that the in- 
formation loss is indeed a minimum for this value of K, so that for Rayleigh statistics the Ideal Ob- 
server (who takes K = 1) minimizes information loss and risk simultaneousiy (when & = 1). 


On the other hand, we note that since the Neyman-Pearson observer does not generally minimize 
information loss, as the example of Figure 6 shows, and yet is optimum in a risk sense when fixed 
false alarm time is important, we may conclude that it is not always necessarily desirable for asys- 
tem to minimize information loss. Detailed control of decision error may be more important in 
some cases. 


Thus, although tests that minimize information loss are likelihood ratio tests, and therefore 
form a subclass of the Bayes tests, they exist under much less general conditions than do the mini- 
mum risktests. That is, the latter exist for any given cost ratio and, while the former exist only 
for certain cost ratios and 's, depending on the statistics. Determination of the broad conditions 
under which the information loss extremum exists and, moreover, is a minimum, awaits further in- 
vestigation. 


*W.M. Siebert and R.M. Lerner (M.I.T.) ina recent unpublished memorandum have independently 
obtained a similar result by a different method. 


239 


5.2. The Information Loss Criterion for Extraction. 


To specialize (5.1) for extraction we first assume that the decision rule is nonrandomized and 
use (4.1) to obtain 


H(¢,6)=- |dv f(v) | P(S|¥) log P, [s|y(v)] ds . (5.22) 
Q 


This expression may be minimized by choosing ¥ to minimize the second integral for arbitrary fixed 
V. As before, P,; may be expressed as 


P,(S| y) = | P(s|¥) P,(yl y) ay, (5.23) 
ry 


where ry denotes the domain of all V's that lead to the decision y. By the argument used previously 
(ot. (55,9), iw is constant over the region ry and zero outside so that 


P,(v]y) WOT? where N(y) -| qv. (5.24) 
Pe Ny 
Thus, (5.23) becomes i 
M.(y) 
P,(S|y) = Ry Where Mgly) = Peavy dyes (5. 25) 
Y 


Differentiating (5.22) with respect to y we obtain the following condition for an information loss 


extremum: 
ML(y) ; 
si N"(y) z 
pera pe NTy i dS = 0, (5. 26) 
2 


where the primes denote differentiation with respect to y. In view of the definitions of (5.24) and 
(5.25), this states the requirement to be fulfilled by [y. The optimum (or extremal) rule for obtain- 
ing y from V must be such that it produces [y's with the properties implied by (5.26). 


We note immediately that (5.26) is satisfied if yisa sufficient statistic, i.e., if P, (SIV) = 
P2(S| y leave ry; (see Sec. 2.5). For in that case we have 


Moly) = P,(Sly) Nly) (5.27) 
and the bracket in (5.26) becomes 
1 evi 
PASTY By P2i8IY Caer 


which satisfies (5.26) identically. The sufficieucy condition results alternatively, if (5.22) is mini- 
mized directly with respect to unconstrained variation of the function P. (Sly). When the distribution 
F,(V) does not admit a sufficient statistic, however, (5.26) gives the condition for an extremum. 
Specific tests that fulfill this condition remain to be investigated. 


6. Summary Remarks 


This paper has attempted to demonstrate some of the advantages of regarding reception systems 
in communication as systems for making statistical decisions with minimum risk or information loss. 
This approach, closely related to game theory, assumes a loss function is assigned at the outset to 
each possible decision error, in some cases depending as well on the choice of decision rule, This 
emphasizes the obvious but often overlooked fact that the criterion of best performance is not absolute 
but arbitrary, its merit depending on how well the loss function chosen reflects the constraints on the 
problem and the design objectives. 


Formulation of the detection and extraction problems within this framework exhibits their close 


140 


relationship. Detailed analysis reveals the nature of optimum detection and extraction systems from 
each point of view, showing how previously used criteria appear as special cases of the more general 
formulation. Methods for comparing actual and ideal systems are also briefly outlined, whereby per- 
formance sacrificed by compromise in design may, at least in principle, be found, The central réle 
of a priori statistical knowledge in practical system design is stressed throughout, and Minimax 
methods for handling incomplete knowledge of this type are discussed and illustrated. 


The new feature of this work is the adaptation of recent advances in the theory of statistical in- 


ference to practical communication problems. The general formulation appears broad enough to 
form the basis of attack on many special problems of interest. 


W1 


Pleas 


hee 


Pie he 
[16]. 


be Alle 


Pi si. 


[bs Si]. 


[1.10]. 


faren2 he 


[1.14]. 


CISLGls 


Bibliography 


N. Wiener, ''The Extrapolation, Interpolation, and Smoothing of Stationary Time Series," 
John Wiley (New York) 1949. 


H.E. Singleton, Theory of Nonlinear Transducers, Tech. Report No. 160, Res. Lab. Elec- 


tronics (M.I.T.) August (1950). 


L.A. Zadeh and J.R. Ragazzini, An Extension of Wiener's Theory of Prediction , J. Appl. 
‘Phys. 21, 645 (19 A 


R.C. Booton, Jr., An Optimization Theory for Time-varying Linear Systems with Non- 
Stationary Statistical Inputs, Proc. I.R.E. 40, 977 Ze 


R.C. Davis, On the Theory of Prediction of Nonstationary Stochastic Processes, J. Appl. Phys. 
CS LOA TM (LO SIZ) 


D.O. North, Analysis of Factors which Determine Signal-to-Noise Discrimination in Pulsed 


Carrier Systems, R.C.A. Report PTR-6C (June, 1943). 


J. H. Van Vleck and D. Middleton, A Theoretical Comparison of the Visual, Aural, and Meter 
Reception of Pulsed Signals in the Presence of Noise, J. App. Phys. I7, 
5940(1946),  - > =) — (ean ane oad 


H. Den Hartog and F.A. Muller, Optimum Instrument Response for Discrimination against 
Spontaneous Fluctuations, Visca ls yaoi We 


UT 


T.S. George, Fluctuations of Ground Clutter Return in Airborne Radar Equipment, J.IL.E.E. 
os (IV) 92 (1952). 


B.M. Dwork, Detection of a Pulse Superimposed on Fluctuation Noise, Proc. I.R.E. 38, 


. H. Urkowitz, Filter for Detection of Small Radar Signals in Clutter, J. Appl. Phys. 24, 1024 


L. Zadeh, Optimum Nonlinear Filters for the Extraction and Detection of Signals, J. Appl. 
Phys. 24, 396 (1953). 


D. Middleton, The Statistical Theory of Signal Detection, Trans. Prof. Group Info. Theory 
PGIT-3, March (1954). 


U. Grenander, Stochastic Processes and Statistical Inference, Arkiv. Mat. (Stockholm) ] and 
3: lO man cdeZivaia Geo 5 


. J.L. Lawson and G.E. Uhlenbeck, "Threshold Signals,'' McGraw-Hill (New York) 1950, 


Vol. 24, M.I.T. Rad. Lab. Series; Sec. (7.5). 


H. Hanse, Doctoral Dissertation (M.I.T.), Jan. 1951. 


. E, Reich and P. Swerling, Jr., Detection of a Sine Wave in Gaussian Noise, J. Appl. Phys. 


24, 289 (1953). 


. D. Middleton, Statistical Criteria for the Detection of Pulsed Carriers in Noise, I, II, 


JimApplc ys. 24, 3/1, 37 


. P.M. Woodward and I. L. Davies, Information Theory and Inverse Probability in Telecom- 


munications, Proc, I.E.E. eee ONT Si7/ aie 
I. L. Davies, On Determining the Presence of Signals in Noise, ibid. p. 45, 


D. Middleton, Statistical Theory of Reception I: Optimum Detection of Signals in Noise, paper 
submitted to J. Appl. Phys. June, 1954. 


142 


[iz2y% 
[m23]; 
pamzale 
le25 4. 
LasZo}s 


fi. 27]: 


(Zon). 


Perl |e 
ARES 


[2.4]. 


205]. 


[2.0] 


azarae 


(228]4 
(259). 


AS Kove 


[2eh2] 


(22 \e 
f2e 43]. 


ie raie 


P2216). 


J.L. Hodges and E.L. Lehmann, Some Problems in Minimax PointEstimation. Ann. Math. 
Stat. 21, 182 (1950). 


Wassily Hoeffding, OptimumNonparametricTests. Proc. 2nd Berkely Symp. U. Cal. Press. 
p. 83 (1950). 


E.L. Lehmann and C. Stein, On the Theory of Some Nonparametric Hypotheses. Ann. Math. 
Stat. 20, 28 (1949). 


D. Middleton, The Statistical Theory of Detection I: Optimum Detection of Signalsin Noise. 
M.1I:T.. Lincoln Lab. Technical Report No. 


A.J.F. Siegert, Passage of Stationary Processes Through Linear and Nonlinear Devices. 
rans. 1.R.E. PGIT-3,4, 


R.C. Davis, The Detectability of Random Signals in the Presence of Noise. Trans. 1.R.E. 
35 ; also J. Appl. ysindo5 7 


A. Wald, "Statistical Decision Functions," John Wiley (New York) 1950. 


D. Blackwell and M.A. Girshick, "Theory of Games and Statistical Decisions," John Wiley 
(New York) (1954). 


E.W. Sampson, Fundamental Natural Concepts of Information Theory AFCRC Report No. E 
5079, Oct. 1951, Sec. 14. 


C.E. Shannon, Mathematical Theory of Communication , Bell Sys. T. J. 27, 379, 623(1948). 


See, for example, J. Neyman, "Lectures and Conferences on Math. Stat. and Probability," 
2nd Ed. Graduate School, U.S.D.A. Washington, 1950, p. 194. 


A. Wald, Ref. [2.1], Sec. (1.6). 


J.L. Hodges and E.L. Lehmann, The Use of Previous Experiences in Reaching Statistical 
Decisions. Ann. Math. Stat. aye 


A. Wald, Ref. (2.1), Theorem (3,20) and ensuing remarks. 


J. Kiefer, Am. Math. Stat. 24, 71 (1953) shows that Wald's restriction of D to the class of 
decision functions for which r(S,8) is a bounded function of Sis unnecessary. 


A. Wald, op. cit. ref. [2.1]. Wald proved the theorem under less restrictive conditions, 
but these, (A) - (D), are sufficient for our applications. His assumptions 
(3. 1)-(3. 3), Chapter 3 are covered by (A) and (B), (3.5), (3.6) by (C), and 
(3.4), (3.7) by (D). 


A. Wald, ref. [2.1]. Theorems(3.5), (3.7), (3.9), (3. 14). 


For reference to the original papers of Fisher, see H. Cramér, "Mathematical Methods of 
Statistics,'' Princeton (1947). 


P.R. Halmos and L.J. Savage, Applications of the Radon-Nikodym Theorem to the Theory of 
Sufficient Statistics. Ann. Math. Stat. 20, . 


R.A. Fisher, Proc. Camb. Phil. Soc. 22, 700 (1925). For useful discussions of the proper- 
ties of Fisher's information measure, see E.J.G. Pitman, Proc. Camb. Phil. 
Soc 3251501 (1936) and J. L. Hodges and E.L. Lehmann, Proc. Berkeley 
Symposium, U. Cal. Press, 1951. J.L. Doob, Trans. Am. Math. Soc. 39, 
410 (1936) discusses another information measure with similar properties. 


P.M. Woodward, Probability and Information Theory, with Applications to Radar, McGraw- 
Hill (1953); Sec. (3.7) gives this result without mention of sufficient statistics. 


143 


2a UG ls 


[3.2 }s 


(3. 2}: 


(oase 


[4c2]. 


[4.2]. 


[4.3]. 


[4.4]. 


[ssayi 


S. Kullback and R.A. Leibler, Ann. Math. Stat. 22, 79 (1951), in discussing hypothesis test- 
ing, use the logarithm of the likelihood ratio as a measure of the information 
contained in an observation, for discrimination between the two hypotheses, 
showing that its average value is positive semidefinite and invariant under a 
sufficient transformation. G.W. Preston, J. Appl. Phys. 24, 841 (1953) uses 
the result (Eq. 2.16), without proof, to define optimum extraction. 


Kullback and Leibler, ref. [2.16]. The constant of proportionality involves the elements of 
Fisher's information matrix. 


D. Middleton, ref. [1.18]. The relations (3.22), (3.23) here replace Eqs. (4.18), (4.19) of 
the tt when one additional term is used in the approximation of 
1, (Rag (2h) 4), 


D. Middleton, Further Remarks on the Nature of the Statistical Observer, J. Appl. Phys. 25 


127 (1954); also, D. Middleton, W.W. Peterson, and P.T. Birdsall, J. Appl. 
Phys. 25s) b28)(1954)- 


L.A. Zadeh, General Filters for Separation of Signal and Noise. (Paper presented at 
Brooklyn Symposium on Information Networks, April 1954) discusses similar 


questions from a somewhat related viewpoint. 


J.L. Hodges and E.L. Lehmann, Some Problems in Minimax Point Estimation, Ann, Math. 
Stat. 21, 182 (1950). 


H. Cramér, Mathematical Methods of Statistics, Princeton (1947) 33.2, pg. 499. 


A. Wald, Contributions to the Theory of Statistical Estimations and Testing Hypotheses. Ann. 
athemotate ‘ 


D. Middleton, Information Loss Attending the Decision Operation in Detection. J. Appl. Phys. 


25, 127 (1954). 


SIGNAL OBSERVATION DECISION 
SPACE SPACE SPACE 


DECISION 
RULE 


8(¥ |v) 


ror xX 
NRO cole 
F(V) OR Gy(X) 


NOISE SPACE 


FIG. | THE DECISION SITUATION 


yh 


NORMALIZED BAYES RISK FOR BINARY DETECTION F EXTRACTION 
THRESHOLD DETECTION 
(PULSED CARRIER—RAYLEIGH STATISTICS) 


K 
é g)- Jay Fivlsys (y-aw))ay 
ERROR PROBABILITIES: r 


P(yis)=/F(v|s)8 (¥-¥,00) av 
Ts 
PUG,|9)= /4y [Feds) Xy- AnD) dy 
kK ‘T 


CONDITIONAL RISK: r($,8) =5.C(¥,$) P(X S) r(S,8)=/C(,S) PU! 9) a¥ 
1 4 
AVERAGE_RISK : R(2,8)= ¥_/ CW; ,S)o(S)PWIS) dS R(o,8)=/ /C(8,S P(U|S)\dSsa¥ 

x ow, as)Pus (@a)+/ (ow. @vrals 

9 = ANALOGOUS RELATIONS FOR DETECTION AND EXTRACTION 

@ 2 3 4 5 

at 
Fig. 2 Fig. 5 


LOSS OF INFORMATION WITH 


THRESHOLD DETECTION CHARACTERIS BAYES TESTS 


(PULSED CARRIER- RAYLEIGH STATISTICS) 


(PULSED CARRIER— RAYLEIGH STATISTICS) 


PW{R | a9) p 
Ay awh 0) Ronan 


EQUIVOCATION (BITS) 


| INFORMATION LOSS CRITERION 
Bel 

log do Ain 

tog 3, ") 


Ky = 


\-a (+a - 


go “a 1-a4Q 


ee tes 
MINIMAX AND BAYES RISK ! i bay a (Fa ‘6 


(PULSED CARRIER - RAYLEICH STATISTICS) 
MINIMAX RISK___1 1 © .¢ 29:6 »¢,=100 
i-a “1-B ar 8 


145 


A NON-LINEAR PREDICTION THEORY 


R,. Drenick 
Radio Corporation of America 
Camden, N. J. 


Introduction 


The theory discussed in this paper deals with the problem of the synthesis of certain filters 
which either extract signals from noise, or extropolate them, in some optimum fashion. Its main 
purpose is a study of the conditions under which these optimum filters are non-linear, by which 
procedures they can in principle be synthesized, and how much better they can be expected to perform 
than their linear counterparts. 


A prediction theory, in the sense in which the word is used in the communications field, is 
usually specified by what assumptions are made in three areas: 


1, The nature of the signal. 
2. The statistics of the noise 
3. The error criterion. 


To these, it will be convenient to add here a fourth, the principle of data acquisition. The 
assumptions which characterize the present theory are explained in some detail in section (Ia) below. 
They are briefly the following: 


1. As regards the signal, it is assumed that it is representable, over the period of time of 
interest, by a polynomial in time, The order of the polynomial is assumed known before hand, 
but its coefficients are assumed unknovm. This assumption underlies also a prediction theory 
due to R. B. Blackman, A. W. Bode, and C, E. Shannon (Ref. 1) which is intended for the design 
of radar tracking filters. 


2. The theory is considered more general, for one, as regards to noise. Unlike its predecessor, it 
is not limited to Gaussian noise but accommodates a very broad class of probability distribu- 
tions, presumably broad enough to cover most practical applications. It is particularly the 
cases of non-Gaussian noise which are found to usually lead to non-linear prediction filters, 


3. Greater generality can also be claimed with some justification for the assumption concerning 
the error criterion. Part of the paper,i.e.namely that dealing with the general theory, 
holds for a rather large variety of error criteria. Another part, however, is restricted to 
the rms error criterion because explicit results can be derived most readily for this case, 


4. As regards the method of data acquisition, the present theory is probably to be considered 
more restricted than many earlier ones: Unlike that of Blackman, Bode, and Shannon, for 
instance, which holds for continuous data acquisition, the present one holds (at least 
formally) only for discontinuous acquisition. The resulting filters are, accordingly, in 
the nature of sampling filters. 


Aside from the reference mentioned, the work reported in this paper has profited from two other 
sources. One is a particularly elegant treatment of a statistical estimation problem by M.A. Girshick 
and L.J. Savage (Ref. 3). In fact the general approach, and even the notation used here, borrows heavily 
from these two authors. The second source is an unpublished memorandum by P. Nesbeda and the Author 
(Ref. 2) in which very similar though somewhat less general results were arrived at by a different 
method. This study was very helpful to the one reported here, 


One other fact may be worth mentioning. The present theory is a part of statistical estimation 
theory and as such is not too closely related to N, Wiener's prediction theory (Ref. 5). (The latter, 
in a similar situation as here, could be obtained as a "Bayes" solution of the prediction problem 
Ref. 6). 


On the basis of the assumptions listed above,the theory is developed in these steps: A function- 
al equation is first introduced which in fact defines what is meant by a predicting (or smoothing) 
filter. Several preliminary concepts are next introduced which are connected with this prediction 
concept, and needed in the later discussions. Tne equation is then introduced which characterizes 
the predictions which is optimum relative to a given type of signal and a given error criterion. A 
proof is given of its optimum performance, 


1446 


The remainder of the paper is concerned with the conveniently simple rms error criterion. Linear 
filters are established for the case of the Gaussian noise, and a recurrence relation is shown to 
exist among them. They are also used as starting points for the developments of the non-linear 
filters which result for non-Gaussian noise. The synthesis method which is obtained for these filters 
is fairly straightforward and is programmed into a six step procedure. The procedure is illustrated 
by an especially simple example, namely, that of a linearly varying signal embedded in weakly non- 
Gaussian noise. The improvement is calculated which is obtained from the non-linear filter over the 
best linear filter. This improvement seems to be quite substantial, 


Ia Assumptions 


As pointed out in the Introduction, it is necessary to specify the nature of the prediction pro- 
blem by assumptions in four areas, namely, the type of signal, the character of the noise, the error 
criterion, and the manner of data acquisition, It will be convenient to discuss these assumptions now 
and, at the same time, to define some terminology which will be used in the remainder of the paper. 


The pure signal, first of all, will be assumed to be in the form of a polynomial in time. 
x(t) = 0+ Ot i a a 6 + 0,¢% (o=f= 9) (Giese) 


This is to say it is assumed that a record of the signal in the absence of noise, taken over a period 
of at most ¢ seconds, can always be fitted with a q -order polynomial, with an error which is negligible 
for the purpose under study. This is the assumption which underlies the theory of Blackman, Bode, and 
Shannon (Ref. 1). 


It will become evident below that the design of the predicting filter depends very strongly on what 
is assumed for the order of the polynomial (1.1). Accordingly, it will be convenient to speak of a "q 
- order" predicting filter, or more briefly a'q - order predictor" when the order of the polynomial (1.1) 
is q. 


The noise, to state the second group of assumptions, will be assumed additive. That is to say, 
the actually observed signal x(t) contaminated with noise ¢(t) will be of the form 


x(t) = x(t) + € (t) 
The noise will be characterized by a multivariate probability density, say, 
VEE Satins J =f (x, — Xo» e1~¥p ae ten x. —x_), Gas2)) 


describing the joint distribution of (nti) noise samples taken at (nti) different times. Concerning f, 
it will be assumed that it can be expanded into a multivariate Gram-Charlier series of type 4. This 
constitutes a restriction on the generality of /.which will be rarely important in practice. (It is, 
in fact,unnecessarily‘narrowfor most of the topics of this paper, but is useful in its last part). 


The next, and third set of assumptions deals with the error criterion. It is customary in this 
connection to prescribe the so-called rms error criterion. This means the following: A penalty is, 
in effect, introduced for an error in each prediction which varies as the square of that error. The 
mean value of this penalty is designated as the "mean square error" and the performance of various 
predicting filters when rated under this rms error criterion, is rated according to this mean square 
error. The optimum filter is, accordingly, one which minimizes it. 


The generalization of this criterion suggests itself, and is, in fact, quite well known (Ref. 6). 
For the purpose of this paper, the penalty W will be assumed to be a function only of the prediction 
error €p 


W = W(E,) (1.3) 
The error criterion, that is the basis on which to rate predicting filters, will then be the mean 


value of W. The penalty W is traditionally called the "loss function". Its mean value is called the 
"risk" and will be denoted. 


r=EtW(e,)} (1.4) 
where the symbol E, as customary, stands for "the mean value of". In this nomenclature then, the 


criterion of performance of a predictor will be the risk to which it leads, The optimum predictor 
will be the one which minimizes the risk. 


147 


In this paper, the loss function will be subject to the natural assumption of being the smallest 
when the prediction error is zero. In addition to that it will be assumed convex and twice differ- 
entiable at the origin. Thus, it includes as a special case the squared error loss. In fact, one 
derives directly from these assumptions that 


wi(é,) = Es v(e,) (1.5) 


where 
v'(0)2 0 (1.6) 


The rms error criterion is, therefore, characterized by 


W (€,) = e V(e,) = 2 (127) 
It will be convenient in what follows to disregard the possibility of the equality sign in (1.5).This 
simplifies the statements which will be made without materially affecting their principle. 


The fourth and final set of assumptions concerns the method of data acquisition. The usual as— 
sumption, and also the one leading to the most elegant results, is that of continuous acquisition. 
Unfortunately, this does not seem possible in the present case, at least if involvement in fairly com- 
plicated non-linear functionals is to be avoided. It will, accordingly, be assumed here that data are 
acquired in a discrete sequence. As a matter of convenience (but not of necessity), a second one will 
be introduced, namely, that acquisition takes place at a uniform rate, That is to say, the time inter- 
val T between any two data points is the same, However, it is not important how short this interval 
is. The rate of acquisition can, in principle, be taken as high as may seem desirable. 


Ib. Terminology 


Two time intervals have been introduced above: ¢ the period over which (1.1) is an adequate 
representation of the signal, and 7, the time interval between data points. Hence, if 


i =nT (n = integer) 


there will be (n+1) data points available on which to make the predictions xg, x1, ...Xn- We 
shall assume that x, denotes the most recent of these points, and that the sequence has been recorded, 
respectively, at the times 


t= iT GhEUR Aon 66 n) 


(This notation implies, that’ the origin of time coincides with the most recent observation). We re- 
serve the symbols j and k, as far as possible, for an integer which assumes, as above, the values from 
O to n. 


Now, the pure signal is given by (1,1) If it had been sampled at the times t; the values 
%,= 0, + 0, (iT) + 0 (-iT)?..+.. + 0, (-j7)1 


would have been obtained. The predicted value of the signal, that is, the value of x at the time tp, 
would be, by (1.1.), 


= 2 q 
*,= 00+ 91 tp + Ont, +... +O tf 
If tp is set equal to zero, the prediction problem reduces to the smoothing problem. One can, there- 
fore, establish as the relation between the desired prediction and the original values observed in 
the absence of noise 


¥,=x, + 6,(-jT -t,) + 0, (CT)? - 2) +... + O,(C47)T 19) (1.8) 


It will be convenient to put 


148 


and write for (1.8) 


The symboli,unless specified otherwise, will be used consistenly in this paper to denote an integer 


which takes the values from z to g. 


One further convention will be useful. We shali frequrntly have to deal with functions of all 


(nt+1) values of x,their joint probability density being one of them. It will be convenient to write, 
for instance, 


fEp €y €,) = (ENG 
In this notation, and using (1.10). one can re-write (1.2): 


heyy =f (x; - x05 = f(x, —%, ees Onz 


ij “to 


The same notation will be convenient also in other funtions of these (nt1/variables. 


IIa Definition and Characterization of Predictors 


The term predictor, or predicting filter, as used in this paper, will be a device, subject to 
ertain restrictions, which accepts signal and noise as input, stores this input, and generates as 
ts output an estimate of the signal value at some future time. It will generate this output from 
yhe input data xj, past and present, according to some formula which will here be called the pre- 
jictor formula or, when confusion with the physical device is unimportant, briefly the predictor. 
The predictor formula, then, will be in the nature of 


W= UX, Xp erase x,) =u («Jo 
(Using again the notation of (1.11). 


As mentioned earlier, this formula will depend on the order q of the polynomial which repres- 
ents the signal. This fact will be expressed by a subscript q. That is to say, 


=. n 
Dad ab) 6 


(1310) 


(2) 


(12) 


stands for a predictor of order gq. A second order predictor wu 2 according to this terminology, will 


be designed to -extrapolate signals which vary, at the most, quadratically with time. It will, of 


course, do as well with signals which vary linearly with time or remain constant since these are only 
special cases of the signal for which it is designed. But it will-be inadequate, for instance, with 
any signal which varies as the cube of time, in the sense that the prediction error will not be zero, 


even in the absense of noise (Ep = ug - xp). 


These statements must now be made more precise. It is necessary, in other words, to specify 
what is meant by a predictor being "designed", or "adequate", for a q-order signal. 


To do this, we proceed with the following argument: Assume that a set of data points x; has been 
observed, that they have been put into a predictor of order q and that the output7g has been obtained. 


Assume next that a second set xj is formed from the first by 


Lax w+ Sa. 0. 
*j ba ae Xe 


that is, by superposing on the first set another polynominal of order 9. We shall require (as seems 


natural) that, in this case, the predictor output should be correspondingly increased by xp: 


yy 3 8 ye n 
ye serene es oe es 


A filter satisfying this functional equation will be said to have the "predictor property," and it 
will be prescribed for all predictors discussed in this paper, Equ. (2.2), while common to all 
predictors (of the same order q). does not determine some specific predictor formula, For, there 
are many formulae which satisfy it, and some will be suggested presently. 


19 


(2.1) 


(2.25) 


Requirements similar to (2.2), are found in all smoothing and prediction theories, although 
the reasons by which they are justified are not always the same, The one used here is parallel to 
the argument first used apparently by Pitman (Ref. 9). Zadeh and Ragazzini (Ref. 6) derive it 
from the specification that the means of input and output should be the same (a specification 
which cannot always be used for error criteria other than the least - squares criterion). 
Schoenberg (e.g., Ref. 10) uses the term "preserving power" for the condition analogous to (242) 6 


The predictor property has an immediate consequence which is virtually equivalent to it. To 
derive it, assume that the originally recorded set of data points xj has yielded all zeros. Then 


wg (x; iw ue Cra =a, (x, fs 73; OR = ug (0)® + Xp 


For those predictors, therefore, which produce no output for zero input (and most will be of that 
type), the property (2.2) prescribes that they should extrapolate exactly when no noise is present. 


It may be worth noting also that the predictor property can be given another interpretation 
with a rather mathematical flavor: One can, first of all, think of (2.1) as a linear transformation. 
Equ. (2.2) then expresses a certain invariance property under that transformation which all predictors 
must have, This is a convenient, if rather abstract, way of expressing the predictor property and 
will be used repeatedly in what follows. : 


Finally, we state without proof a few facts which are chiefly of statistical interest. It can 
be shown that the risk r from any predictor having the property (2.2) is constant, that is, independ- 
ent of xp and 6;. In fact, r can be obtained formally by calculating the mean loss for vanishing Xp 
and 6,;, This will be expressed (following the notation of Girshick and Savage) by writing £o for the 
mean in such a case instead of £ as in (1.4) 


r= E (WE )}=E SAW (ug) } (233) 


An optimum predictor, namely, the one that minimizes the risk, minimizes this constant. By some 
well known theorems (e.g., Ref. 6), this makes the optimm predictor the minimax solution to the 
prediction problem. 


IIb Linear Predictors 


It may be useful to illustrate the above said by two examples which are presented below. It 
will be noticed that both are linear predictors in the sense that the predictor formulae are linear 
in the xj. Formulae like these can be, and have been, derived for the case of continuous data 
acquisition by the method of Ref. 1. Set 2, is particular, is the discontinuous analog of optimm 
filters obtained by that method for the case of white Gaussian noise. 


Examples of predictor formlae are: 


SET 1: a 
vO (x) =%o 
t t 
pb 
uy (xO = ax, +B xp a=1-— eas 
T Ti 
2 
3 ty ty ty e 1 A ty 
wy (x JO = ax, + Bx, + yxy a=1+- —(1-—), B=— -2—,y=-]|]— -~— 
a 7. iPe T RATee Mi 
etc. 
SET 2: 
1 -n 
ua — 2%; 
n+1j=0 
2x, ay; 
—2a x. Sa? 
1j f 1j 
u(x Jo = 
n+1 -Xa); 
74 
2a, 244; 


150 


Set 2 (continued) 


x; Bs Xa,, 2a, 
5S a 
Ray 5x; Lay; Xa) ,42; 
24 
Zag jx; La, ,43; Xap, 
2 
# (x Jo = 
n+1 -Xa); 2a; 
2 
~2a,, Lay, La, a, 
2 
2a); Lay j49; a5, 
etc. 


It is easy to convince oneself that these sets actually have the predictor property. A method by 
which these, and similar ones, can be derived will become clear later in this memo (Section III). The 
examples cited above are probably sufficient to facilitate their formal extension to higher orders 
than the second. 


Linear predictors such as these form a fairly large class which will play an important role in the 
theory below. They will, in fact, be so often used as to make-a special notation for them advisable, 
In this paper,-a linear predictor of order 7 will be denoted by %% , an arbitrary predictor, by uqe 


One further item: It will be important to derive an expression for the output of aqg-order pre- 
dictor, particulary a linear one, where its input is a polynomial of an order r higher than q , and no 
noise. Assume, more specifically, that a given predictor is linear and of order q,vg, and that the 
input is of the form 


Xx,= 2%, + > (eG) (r>q) 


Because of its linearity, the predictor will produce an output which will be linearly superposed of 
several portions. First, there will be the response to input portion 


wnich, due to the predictor property of vq , will be 


q 
U, (%» + 3 as, 9,35 = 4 (0) + x 


#=1 


Secondly, there will be a group of terms in the output resulting from the input portion 


These will be the form 


r r 
al >3 26] i = 6, ve (a; Jp 
i t=qt+1 


again because of the linearity of the device. The output of a q order predictor with an input 
of order r greater than g will, accordingly, be 


Fri 
vapt 2 60 peav OM+s,+ 5S 8,0, (a) 
j=l feget 
or, slightly more generally, 
: r n r 
t oo n n 
va nyt Ett | gears, &, 6; 04 (4; )6 | (2.4) 


This relation will be used presently. 


151 


IIb. The Auxiliary Quantities 2° 


The optimum predictors which are about to be developed will be seen to involve data points xj 
only in certain combinations, with quite specific properties. These combinations will be denoted 
with z; , in accordance with the notation introduced by Girshik and Savage (Rey 3) 5 


The quantities 27 can be constructed promptly if one already knows a set of predictors of all 
orders up to and including the order q of interest. Let it, therefore, be assumed that such a set 
is known and, more specifically, that it is linear. This set will be denoted with o0,V1,--Uq- It 
could, for instance, be one of the two examples listed above in Section IIa. The quantities 2); are 
then formed as linear combinations of *j and these vi; thus, 


z,=%x;+ Z Vig Yi (235) 


=o 


The coefficients yi; are, however, not arbitrary. They are, more specifically, so determined that z; 
remain invariant under the transformation (2.1). The procedure by which the yij can be determined 
is straightforward and will be carried out, as a matter of convenience, for the case 7=2 . The 
extension to higher orders will be quite obvious from that. 


The coefficients yi; in (2.5) can be determined directly from the requirement of invariance 
under the above transformation, with q = 2. For one must have 


2, 2 2 
=> ra rey ° . oe * bas * * . al ie . . as AN 
z, spent Se a;, 0; *Pei Us Fi Bpse “iy a;,O,)¢ + Vij bis wate a; 90 + 2; LIE pe a; ; 6) 


| 


Using the predictor property of v., v1, V2, and observing (2.4), this leads to the following system of 
(3n+3) simultaneous equations for the i; 


opty aga Y2,==1 


Y%oj%o @1p)0 =~ 41; 


: (2.6) 
Yop heey, OT iy po as 
This derivation of the coefficients yi; for the case g = 2 is clearly and easily extended to 


higher orders, 


In the discussions which follow, it will sometimes be convenient to rearrange the z; into a 


different form, If one takes into account the specific solutions obtained for the 7¥ij from (2.6), one 
finds promptly that one can write, 


For q=0: Zia eo 


ay; 
1j 
For q=1: 2,=(%,-¥,)-——— |), -y,] 2.7) 
v5 (a,;) 
One, a2; 
For q=2: z,= (x;-v)- [(vu, —v9-C,(v, -v,))- [v,-v], where Cy = 44 (42,)/v; (a9;) 
¥,(4;;) v1 (49,) 
a1; Cope 93; 
For q=3: z= (x,;— 3) — L(y, - v3) —C,(v, — v3) - Cy (vy - 3) 1 - [(v, - v3) -C3(v,-v3)) - lv,-v3] 
v,(4;,) v1 (45;) v2 (43,) 


152 


where 
¥, (453) v1 (4,3) 


ae osones (453), Cy = 04 (443) v9 ays) 
YQ (4j9) 0, (a;9) 


etc. 


The predictors “o' ¥1’ %2'%3 in these formulae stand, of course, for Vy (Jor 01 (xO U2 (* NG 03 (x 4)G. 


IIc. The Optimum Predictor 


In this section, the equation will be introduced which determines the optimum predictor of order 
q for a given loss function. This is a predictor ug which minimizes the risk (2.4) (asterisks will 
henceforth be used to denote optimum predictors): 


= E, {W (u‘)}=min E,iW(u,)} 


It may be worth repeating that the loss functions which are admitted here, following the discussion in 
Section Ia, are of the form 


Wiy=y Vy), Wily)20 


The equation which determines the optimum predictor will be seen to be an implicit one. It is 
assumed in Sg tea it that some complete set of linear predictors, up to and including one of the 
order % %» %1'** %q is already known. Either cf the two sets in Section IIb, for instance, could be used 
(but would not usually be the most appropriate). The optimum predictor ug. can then be derived form- 
ally from the known one by adding a correction term to the g-order linear predictor vg: 


(2.8) 


The equation under discussion is an implicit equation for this correction term Au. In fact, it will be 
shown presently that 4u is determined by 


E (v Viv, + Au) |z,] 
Au = 7 Z 4 di (2.9) 
E lV, +Au |z,)] 


Here,£,[V(v, + Au) |z,]stands for the conditional expectation of V(v,+A4w) for given 2z;. Thez; are the 
quantities introduced in the preceding section, and formed using the known ¥, ¥1+:%q° The symbol Eo 
has been introduced in Section IIa as the mean value of a function for X= 9,=9).-. = O70; 


Equations (2.8), (2.9) are the main results of the present paper. A corresponding equation for 
a squared-error loss function, and for what in the present terminology would be called a zero-order 
predictor, is contained in the paper by Girshick and Savage (Ref. 3). 


It mst now be proven that equation (2.9) does indeed yield a correction term which turns uq 


into the optim, This proof will proceed in three steps. It will first be shown that ug“ is again 
a g-order predictor. The second step will establish an auxiliary fact, namely, that 


q 


E .1W'(u,)} = 0 (2.10) 


The third step will finally prove that ug* is the optimum, that is, that it minimizes the riskr. 

The first step is very direct. Au, by equation (2.9), is clearly a function of only the 2; which 
have been constructed to be invariant under the transformation. vg has the predictor property under 
the same transformation. Hence, ug* has the predictor property. 


To establish (2.10), note that, by (1.5) and (1.6) 


E, iw (u)} SBAEiw (vat Au)|z I = E,E iu, + Au) Viv, + Au) | z;} = EAE lu, V (vg + Au) |z}+Au EAV(v, 4 Au)|z 33 


153 


In these expressions, the symbol fpf) indicates the mean value is first taken conditionally, for given 
B55 and then over the 2;. The term contained in the exterior braces of the last expression is zero, 
by (2.9). This establishes (2.10). 

The last step in the process is the proof of the optimum character of ug*. This proof is carried 


out by showing that the risk involved in using any predictor ug other than ug* cannot be smaller than 
the risk due to ug. This is seen in the following way: 


ru) =E IW (u,)3 

= E {W(ul) + (uw — ue) W' (ue) + (wu, — 42)? OW" (ue )} 

=1(u’) + Ej\(u,- ui) W' (4) + Ei (ug - Wie w' (Ou,)3 

=r(uj) +E Eg { (u, - uw) Wu) |zj} + 6 BE (ug ~ u2)? W" (Ou2) |Z, 1 
Now, ug - ug. depends only on 2; because it is invariant under the transformation (2.1). Therefore, 

r(u,) =r(uy) + Ei \(u, -u’) E,W (u2)|z;] b +2 E, iu, - a)2 E,{W" (Ou, )|z,3 
The second of the right-hand terms vanishes because of (2.10), and the third is non-negative because - 
of the convexity of Wy). Hence, the risk r(ug) with the arbitrary predictor ug 
rug) = rue) +4 EA (wu, — wu, )? Btw" (Gu, )|z, 11 (2.11) 

is certainly no smaller than that obtained with ug, and equality prevails only in the unusual cases 


in which 


E,\(u,-u,)7}=0 


Thus, ug is indeed an optimum predictor, 

Equation (2.9) determining this optimum predictor will often be difficult to mechanize. The 
cases, for instance, in which Wy) is expressible by a polynomial of higher order than the second, 
lead to algebraic equations for Au. This is fortunately, not true for the case of least-square pre- 


diction. In this case the equation for Au is linear and it can be solved explicitly. The remainder 
of this paper will, accordingly, deal with that specific case, 


III, lLeast-Square Error Prediction 


Illa. Specialization To The RMS Error Criterion 


The specialization of the preceding theory to the rms can be achieved readily. In fact, since 
Vi/y) =1 in this case, equation (2.9) reduces to 


uy =%g-E,(vg|z;) f Gch) 


This shows that, in the rms-error case, the optimum predictor is determined by an explicit equation, 


It will be useful later on to have the equivalent to equation (2.11) which established ug. as the 
optimum, It is 


Btu, 7}=8, ta }= Eo Mu, - uh (3.2) 


It has already become apparent in the theory so far that linear predictors play an important role, 
This, and certain other features, recommend them for an early discussion. The next section is, in 
fact, devoted to it. 


15h 


Iltb. Optimum Linear Predictors 


It has been known for some time that the optimum predictor is linear if the probability density 
of the noise is Gaussian and the loss function is the squared error (Ref. 8). A proof that the same 
set of linear predictors is also optimm for a large class of symmetric loss functions was recently 
given by P, Nesbeda and the writer (Ref. 7). Optimm linear predictors can be derived rather expedit- 
iously by the present theory, It is found, more specifically, that one can use a recursion formula to 
proceed from zero order to higher order linear predictors. This,-.and their usefulness in the present 
theory, suggests their derivation here, 


Again, as a matter of convenience, the derivation will be developed for limited order q, that is, 
for q =o and gq = 1. The extension to predictors of other orders will be quite plain. , 


By (3.1), any optimm predictor 
E, (u, lz) =O 


In the case under consideration, this condition can be written (using again v's to denote linear, and 
asterisks to denote optimm, predictors) 


[fe fu dos doy. doy btaj- E yyy vf)B=0 (325) 


where b (y5% stands for the (n + 1) dimensional pisussien 


1 
d (y,)% = (2n) A erie 5: AjK Yj IK? 
2A j,K=0O 


Assume first the order gq = 0. Equation (3.3) can then be rewritten in the form 


. 1 1 : 
O= fo, blz; + vs)” dv. = nef => Nik “et _ fur exp az layd, PEAK = ie 0 te AiK “at du. (3.4) 


It is easy to convince oneself, by any number of equivalent arguments, that (3.4) can hold only if 


Se eO) (3.5) 


This relation will be useful in this form later on (Section III). The optimum linear predictor of 
order zero is, in fact, a direct consequence of (3.5). By putting in it, from (2.5), 


yeas 
7 1 ° 
one has (3.6) 
vo = BE Aig [82 Nix 


for the optimum linear predictor of order zero. 


The procedure is entirely parallel for the corresponding first-order predictor. The integral 

in (3.3) is then best written in the form, 
a, a 
(ve -v))- v7 if * = flv; dw 0 My lz, -— 
u, (1) v. (1) 


Lie 
das 
e Omaal is 


O = ffvy du, dv; blz, - 


that is, using (2.7). If the same procedure as above is applied to the present integral, one obtains 
two equations which are necessary for the vanishing of this integral. One of these is again (3.5). 
The other, namely, 


> Nik 41x 27;=0 : i - (3.7) 


155 


is the relation among z; which, for first-order predictors, “accompanies (3.5). The predictor formula 
itself is promptly derived from this by replacing 2; with xj, using for instance, equation (2.5). 
One obtains 


PSEA A ein ee Ngee ee oy ik (3.8) 


as a recursion formula for the optimum linear predictor of first order, The predictor vo could be 
placed in (3.8) with the expression (3.6). 


The extension of this type of predictor forma to predictors of arbitrary order is quite clear. 
One must have 


CO SDP y\, 


diate Naan &; je Voyage he eae, 22 Nie Vey tak (3.9) 


Ugo PK Vag 


This is the general recursion formula for an optimum linear predictor of arbitrary order, expressed in 
terms of those of lower orders, 


These expressions become especially simple in those cases in which successive observations are 
statistically independent, and the noise is stationary. Then 


AK = 5x7? 
The resulting predictor formlae are then those listed earlier as sample set 2 in Section Ila. 


IIIc. Optimum Non-Linear Predictors 
f(x; —-x,- i ij 96 


When the probability distribution a aa of the noise is not Gaussian the 
optimum rms predictors are found to be in general Ete This will be shown in the present 
section, It will develop, more particularly, that these predictors can be obtained by a fairly 
simple procedure in many cases which are likely to be of practical interest. These cases are 
characterized by the fact that the noise distributions do not differ too radically from the Gaussian. 
Under such conditions, one can express the actual probability density of the noise as an n-dimensional 
Gram-Charlier series (of type A), and have reason to hope that only a few terms of this series will 
be necessary for an adequate representation. 


An n-dimensional Gram-Charlier series is one in which the given probability density of the noise 
is developed into a series starting with a well chosen Gaussian, ¢(*;-*,-24,,9;)% | and-continuing 
with the derivatives of that Gaussian with respect to all of its variables Kop Xpre ee Xt A "well 
chosen" Gaussian is one whose first- and second-order moments (i.e., means, dispersions and correlation) 
agree with those of the given distribution fx,- x, - 2a, 0;)%, An arbitrary term from 
such a series could be of the form 


V~ztVnt+e+ +vV 
Gi» Vo oY) © ous 2 


q 
bles xp- 2 5, 6))8 (3.10) 


vilvyloou! dx, ‘o Ox, %2:: 0x, 'n . 


where a(vz, v2, .- Un) is a constant. If the Gaussian is in fact chosen as recommended, no derivative 
of an order lower than the third will appear in the series, and if the actual noise distribution is 
symmetric, the fourth-order derivatives will be the lowest (other than the Gaussian itself, of 
course, which is the zeroth derivative). 


For brevity, the notation 
= a 
ae - — $(x, =i 


vi ax¥ 
7 


q 
ig >S a; 6.8 (Ga 
t= 1 


will be used for terms as the one shown above, The Gram-Charlier series for the general noise distri- 
bution can then be written symbolically 


a 
[(x,;-%, -2a4,, 00-2 — = OR =x = Lid Oe (3.12) 


The object of the present discussion, is to derive from equation (3.1) ontimum g-order predictors 
for distributions of this type. To do this, as mentioned before, one must have a complete set of linear 
‘predictors of all orders, from zero up to and including 9. These predictors can be arbitrarily chosen; 
for instance, either of the sets in Section Ila would serve for the purpose, The necessary procedure 
is, however, the most expeditious if this starting set of predictors is well chosen. Particularly well 
suited is, in this case, the "corresponding" set of optimm linear predictors. That is, the set vo, V1- 
which would be optima if in the Gram-Charlier series (3.12) all terms but the first were zero (a,=0, sn 


Let it be assumed in what follows that such a set has been determined as, indeed, it always could 
be by the procedure outlined in the preceding section. The remaining task is then the calculation of 
the conditional expectation in (3.1), 


Sf-+f v, dv, dv,-+dv hoes ope aN 
E,(v,|2;) = : ile ues 2 u y Se Oe (3.13) 


Jo fey dogs dog las 3 vy vit D 
= 


In this equation, the asterisks are not shown for the v’s to indicate that they are not optima for the 
distribution (3.12). 


In order to simplify the notation in what follows, the proof will be carried out for the special 
case g = 1.As before, its extension to arbitrary 9 will be quite evident. 


Equation (3.13) shows that the optimum predictor will usually involve a fraction 4/D, and it is 
necessary to egtablish a procedure by which to calculate it. This procedure will be seen to be quite 
simple, particularly so for the denominator D. The integral in D can, using the series notation 
(3.12), be written 

a, av a;. 
D= 2_ ffdv, 
ns ls v, (1) 


af 


(v,-v,)-0,)3 


Now, vo and vz are predictors which would be optimum for the linear problem, This can be made use of, 
and the denominator transformed into 


d, av 
DK — —¢ (2G 


Uy! dz7 


where Xk is a proportionality constant of no relevance to the present problem. 


The rule of formation of D in the predictor formula is, accordingly, very simple and patently true 
regardless.of gq. In the series expansion of the noise distribution fle;)*, replace ¢€j by 2;. The 
result is, 


pnp (3.14) 


The rule for the numerator W is only a trifle more complicated. The integral in W is 


a, on a1; 
N= Seta dw, dv, b (z;,- Ls ont 
vu! dzv v, (1) 


One finds, by the same argument as for D, 
a ouaw 
= * 
N=K 2— - 
v vi dzv-l eG), 
j 


The rule of formation of the numerator is, accordingly, the following. In the Gram-Charlier 
expansion (3.12) of f(x;/", omit the first term (the Gaussian proper), and in all others reduce by 
one the order of the derivative with respect to one of the variables. Finally, replace xj with 25. 
This rule is evidently the same regardless of the order q of the desired predictor. 


The conclusion is then reached that the formula for the optimum non-linear predictor of order 
q is 


ay quel 
SS 7 $ (z)5 
ue =v_t eV DE tie Ube oes (3.16) 
fz ,N 


The procedure by which it can be synthesized can now be outlined in the following steps: 


a) Develop the given probability density f(x;/"% of the noise into a generalized Gram-Charlier 


(Type A) series. 


b) First omit all terms from it but the first (which is a Gaussian) and derive a set of optimm 
linear predictors for this Gaussian, For the derivation, use the recursion process which is 
represented by equation (3.9). Find also the necessary quantities v (aij) introduced in (2.4). 


c) Use this set to form the quantities zj, by the procedure suggested in Section IIc, or by 
equations (2.7) 


a) Substitute these zj im place of the x; into the expression for the probability density f(xj)} 
of the noise, This is the denominator of the fraction in the desired predictor formlae,. 


e) Next, omit the first term of the Gram-Charlier series, and lower by one the derivatives with 
respect to one (any one) of the variables in every other term. Substitute 2; for the xj. 
This is the numerator of the fraction of the predictor formula. 


f) Insert the fraction as the second right-hand-side term into (3.16). For the first term, use 
the g-order linear predictor obtained under b). 


IlId, MDllustrative Derivation of a Simple Non-Linear Predictor 


The process of establishing a non-linear predictor by this method is fairly straight-forward, as 
is evident from the outline in the preceding section. It will now be illustrated with a simple example 
(which was also used in Ref. 2), namely the determination of a certain first-order predictor u,. 


Let it be assumed that the noise source is virtually white, or that the data acquisition is rela- 
tively slow, so that successive noise samples can be considered independent. The joint probability 
density of the noise fej)” (equation (1.2)) can then be written as a product 


fe =h (eq) fy (Eq) «+ fy (,) 


(Here, and in the discussion below, the subscript 1 will be used to point out a uni-variate probability 
density.) Assume furthermore, that f1‘ej) is symmetrical and weakly non-Gaussian. That is 
a, of 
f,(€,)=6,(«,) +— —¢, fe,) 

/ ; 4! de ‘ 
where oa, is so small that’ its square can be neglected relative to unity. This is not as mich of a 
restriction as it might seem. In practice, it will often be possible to let oa, vary over a range from 
-5 to +5 before the approximations to be used here become altogether indefensible. The range of 
shapes of probability densities which can be produced by this variation in a is shown in fig. l. 


For. small ag then, one can write approximately 


34 


%4 
((Gjes|) Lae > MW ia(Grane 
j/o Wee =| ne K (3527) 
i 
This puts the probability distribution into the desired form of a Gram-Charlier series, and step (a) 
in the procedure has been affected. 


The next step is to establish the optimum linear predictors of zero and first order, vo and v1. 
Since the successive noise samples are statistically independent, the remark at the end of Section IIIb 
applies. According to this remark, the required optimum linear predictors are given by set 2 of Section 
IIa, that is, 


Ex, vate 
1 E 2 
5 OF241KEK 2aiK 
Devi *K? v7 = 
n+I1 n+1 -2a1x 
= 2 
2arK ZatK 


158 


Also needed in this connection is the response of the zero-order filter to a linear input which, in the 


present case, is. 
1 


n 
Sa 


n_ 
Mo (4170 ~ : j 
n+1 j=0 


oO 


The third step in the procedure is to determine the quantities z;. They are, by equations (237) 


6) Det Keno ian Z 51 
Sf eee Com te a en Grell 
4 (44;)0 a $, p Ss; 59 
Here, sf is the Kronecker delta, and Si, Sg and ? stand for 
n n ty 
Sq ae) Gt) 
Ke=I0 2d 
n Pn (2n+1) nt, os 
S5= S a? = (n+ 1) T? ———<$—$——— | 
K= 9 6 teh te 
é 2 
p =n-— 
sj 


This completes the preliminaries. 


The numerator WN and denominator D for the fraction in uy are obtained by steps e) and d), 


respectively, of the procedure. They are 


1% n oe n a5 2 at n 
Ne a I $, (2x), Dei Ilo, (zx) 
4! j=0 dz? K=o0 YS dz4 K=o 
The desired non-linear predictor is, accordingly, 
a, oe 
ee len (ae) 
4! i dz? K a, 
ve : = SH, (z,) 
di eS Oe =v, +— H, z; 


Tees 2mm bel(zee) 
44 oz dK” * 


Here, use has been made of the smallness of %%4 which renders negligible the second term in the 
denominator. 43(z;) is the third-order Hermite polynomial: 


ees 
H, (z,) =z; 3z; 


where zj is the linear combination (3.18) of the inputs, 11 is the optimum linear predictor 


quoted above. 


The predictor (3.20) is very similar, though slightly simpler than one obtained earlier 
(in Ref. 2). It is interesting to note in this connection that 4, is not unique, That is to 


say, there exist several expressions, all of which have the same optimum characteristics, namely, 


have the same low rms error. Most of these involve the third-order Hermite polynomial, and 


all of them have non-linear portions depending on 2; only. Their algebraic structures, however, 


differ somewhat. 


159 


(3.20) 


In fact, one such variant which is slightly simpler than (3.20) can be derived promptly. It 
follows from equation (3.5) that the linear term in : 


n n 
D2 H, (z,) = S 22-322, 
j=o j= 


can be omitted and 
; ag Bs 
upauyt—% zy Ot) 
4! 7 


be written for the optimum predictor, An illustrative mechanization of the corresponding filter 
is shown in fig. 2. : 


Evaluation of a Simple Non-Linear Predictor 


It is of interest to investigate the performance of a non-linear predictor such as the one 
given by equation (3.21). That is to say, one might ask what is to be gained in the magnitude of 
the rms error from using the optimm non-linear predictor u, as compared to an optimm linear one, 
namely, v1. 


The best way, such as it is, of carrying out this comparison is by equation (3.2) According 
to it, and to (3521), the rms errors of the outputs of u, and vi are connected by 


* Py On 7 ; n 3 2 
Bolu)? 8, (of) = l(a; - 071 =(—] Bol = 


Hence, to obtain the desired comparison, one mst evaluate the integral 
S32 3)2 4 
< erat =I 2 sd a $y} (x) dx, dxa° dx,, (3.22) 


where zj; is given by (3.18). No method has so far been found by which this integration could be 
carried out neatly and smoothly. This is apparentiy a rather common difficulty in statistics, and 
the case under consideration is no exception, despite its simplicity. One can proceed a few steps 
algebraically, at the price of mounting involvement, but ultimately numerical work must be resorted 
to. This was done and led to the result shown in fig. 3. The quantity plotted there is, more 
specifically, 


Ac/e=VIE, (uj 2)-E, w*)/e (v2) 


It can be interpreted as the improvement in the least-square error which is achieved by using the 
optimum non-linear predictor rather than the corresponding linear predictor, This improvement is 
seen to be by no means negligible, even if the memory of the filter is limited to but a few data 
points. 


160 


1) 


2) 


3) 


4) 


5) 


6) 
7) 


8) 


9) 


10) 


References 
R.B. Blackman, A.W. Bode and C.E. Shannon, "Data Smoothing and Prediction in Fire-Control 
System", Tech, Report of Division 7, NDRC, Vol. 1 (1946). 


R. Drenick and P. Nesbeda, "A Preliminary Study of Optimum Non-linear Prediction", 
unpublished memorandum, dated 4-30-53. 


M.A. Girshick and L.J. Savage, "Bayes and Minimum Estimates for Quadratic Loss Functions", 
Second Berkeley Symposium on Mathematical Statistics and Probability, Univ. Calif. Press, 
1751, Ps. 528 


L, Zadeh, and J.R. Ragazzini, "Extension of Wiener's Theory of Prediction", Jour. of 
Applied Phys. 21(1950), p.645. 


N, Wiener, "Extrapolation and Interpolation of Stationary Time Series", John Wiley & Sons, 
1949. 


A, Wald, "Statistical Decision Functions", John Wiley & Sons, 1950. 


R. Drenick and P. Nesbeda, "A Class of Optimum Linear Predictors", Paper presented at the 
Washington meeting of Institute Mathematical Statistics, Apr. 1953. 


H.E, Singleton, "Theory of Nonlinear Transducers", Tech, Report No. 160, Research Lab, of 
Electronics, Mass. Inst. Tech., 8-12-50. 


E.J.G. Pitman, "The Estimation of Location and Scale Parameters", Biometrika 30 (1939) 
5 SNES 


I,J. Schoenberg, On Smoothing Operations and Their Generating Functions, Bull. Am. Math. 
Soe. Vol. 59 (1953) p.199. 


iN 


L 
-3 ZN eel . +! NU ot2 +3 


EXAMPLES OF"WEAKLY” NON-GAUSSIAN DISTRIBUTIONS (9=1) 


Fig. 1 


161 


~.375 


Crentntt 


TUL sp eles Cle ae nite 
sl [a Ss a ld 
Aimee i= 


ILLUSTRATIVE MECHANIZATION 1*' ORDER PREDICTOR, "WEAKLY" 
NON-GAUSSIAN NOISE, PREDICTION TIME t *T 
(C... CUBING DEVICE) 


Big. 2 


: 


3 4 5 6 
4%, 


PERCENT ‘IMPROVEMENT ae IN THE RMS ERROR 
DUE TO TRACKING WITH A NONLINEAR FILTER 


Fig. 3 


162 


THE DETECTION OF SIGNALS PERTURBED BY SCATTER AND NOISE” 


Robert Price® 


Student Member, IRE 
Research Laboratory of Electronics 
and 
Lincoln Laboratory 
Massachusetts Institute of Technology 
Cambridge, Massachusetts 


Introduction 


In recent years, mich study has been devoted to the problem of conveying information efficiently 
through channels in which the message-bearing waveforms may undergo distortion. Statistical methods 
have proven an effective tool with which to analyze and synthesize transmission systems as a whole, 
especially where channel perturbations are of a random nature. The statistical approach is particu- 
larly rewarding when questions of receiver optimization are considered. Provided that the transmitter 
and channel conform to the realistic, yet very general, model proposed by Shannon, + the ideal receiver 
assumes the form of a probability-computer. This result was recognized by Woodward and Davies 3° and 
Van Vleck and Middleton? have similarly treated ideal detection as the testing of statistical hypotheses. 


Random channel disturbances which have already received considerable attention have been mostly 
of an additive natures shot- and thermal- noise, atmospheric impulse-noise, and adjacent-channel inter- 
ference are familiar examples in this category. An excellent example of the application of probability- 
computing methods is given in a paper by Reich and Swerling, 44 who have found the functional form of the 
ideal receiver for the thermal- or shot- noise case. Recently, studies of VHF "scatter" propagation 
beyond the line-of-sight have led to a channel model in which disturbances are encountered which are 
no longer of a purely additive nature. As in the case of additive gaussian noise, it has been found 
possible to obtain explicitly the functional form of the ideal, probability-computing receiver for the 
"scatter" channel. The analysis is set forth in the three sections of this paper. In Section A the 
mechanism by which the channel perturbs a transmitted signal is studied in detail, so that the equiv- 
alent mathematical operations may be completely specified. Section B then applies the results of 
Section A to the exact derivation of the functional form of the probability-computing receiver. 
Finally, Section C discusses the simplification in receiver design which is made possible by certain 
approximations valid at small signal-to-noise ratios. 


A. Study of the Scatter Channel 


For VHF transmissions beyond the line-of-sight, Booker and Gordon? have postulated a tropospheric 
scattering model. Their hypothesis is that the transmitted wave impinges upon a great number of 
randomly moving irregularities in the troposphere, pe that the received wave is the resultant of many 
small scattered waves. Booker, Ratcliffe and Shinn” have proposed a similar model in commection with 
ionospheric propagation. This paper does not propose to discuss the validity of this model, but will 
accept it mrely as a form of disturbed channel interesting in its own right, regardless of whether it 
truly exists in nature or not. 


a This paper is excerpted from "Statistical Theory Applied to Communication through Multipath 
Disturbances", an Sc.D. thesis submitted to the Department of Electrical Engineering, Massa- 
chusetts Institute of Technology, on 2h August 1953. The research reported in this paper was 
supported jointly by the Army, Navy, and Air Force under contract with the Massachusetts 
Institute of Technology. 


B Presently at the Commonwealth Scientific and Industrial Research Organization, Sydney, 
N.S.W., Australia. 


163 


The analysis proceeds on the basis of the particular scattering model considered by Rice, / in 
which the randomly-moving irregularities are assumed all of equal size. Taking a sine wave as the 
transmitted signal, x9(t) = sin wot, it can be shown through Rayleigh's "random walk" analysis8 and 


the central limit theorem that the received signal zo(t) has a gaussian distribution in all dimensions. 
That is, zo(t) is statistically identical to filtered thermal noise having the same power spectrum, 
Assuming that this spectrum Y(w) is symmetric and narrow-banded relative to its center frequency os 
we may write, from Rice, 


Z(t) = y,(t) sin wt + y,(t) COS Wot (1) 


y,(t) and y,(t) are independent gaussian waveforms, both with autocorrelation Y(t) given by 


Bt) = §$ Y(w) cos (w-w,)t do (2) 
0 


If the sine wave is now narrow-band modulated, so that an information-bearing waveform x(t) is trans- 
mitted, 
x(t) = x,(t) sin wot + x, (t) COB wt (3) 


The signal z(t), observed following scatter, is then 


a(t) = z(t) sin wt + z(t) cos Wot (4) 
where 

z(t) = x(t) y(t) - x(t) y,(t) (5) 
and 

z(t) = x(t) y(t) + x(t) y,(t) (6) 


Thus scatter may be pictured as a complex multiplicative process. 


To make the channel model as realistic as possible, we must not neglect the presence of additive 
shot- and thermal-noise. We shall consider this gaussian noise n(t) to be localized at the receiver 
input, and shall assume that its spectrum N(w) is symmetric and narrow-band about the carrier frequency 
®, although broad relative to the spectrum of a(t)e 


n(t) = n,(t) sin at + n,(t) cos wot (7) 


n,(t) and n(t) are independent gaussian waveforms, both with autocorrelation D(x) given by 


b(t) = S N(w) cos (w - &)t dw (8) 
10) 


Combining scatter and noise, the signal w(t) finally observed at the receiver is 


w(t) = w(t) sin wt + w(t) cos wot (9) 
where 

wo(t) = x(t) y(t) - x(t) y(t) + n(t) (10) 
and ; 

wy(t) = x,(t) ¥o(b) + x(t) ¥,(t) + a(t) (14) 


Since y,(t), y,(t), n,(t) and n(t) are independent gaussian waveforms, w,(t) and w(t) share a joint 


gaussian distribution, p[w,(t), w(t) Ax, (t), x(t) | » assuming that x(t) is known apriori. 


164, 


B. The Probability-Computing Receiver 


As mentioned in the introduction, the transmitter is assumed to conform to the Shannon model; 
that is, it contains an information source which generates a sequence of symbols drawn randomly and 


independently from a finite alphabet of size M. These symbols are encoded one-to-one into voltage 


waveforms x(k) (t), ’& = 1,2, eoe,M, all of duration T and having the character of modulated carriers, 

as specified in Section A. The receiver cannot know the x(t) sequence apriori if any information is 

to be conveyed, and the channel perturbations are such that observation of the received signal is not 

sufficient to state with certainty which symbols were transmitted. Under these conditions, the best 

which a receiver can do, within SO eee mathematical framework of statistics, is to calculate 
x 


the conditional probabilities P[ x‘"/(t)Av(t)] , k = 1, 2, e+.) M, symbol-by-symbol, 


Using Baye's Theorem, and assuming that the apriori P (x) (4) ] are all equal, we have 


p {x (+) A(t) | = K(w) p [wg (+) 5m, (t)/ (+) ] a 


where K(w) is constant for all k, for any given symbol interval. The probability computation is 
generally simplified if w(t) and w(t) are conditionally independent, so that 


p [wg Ct) w(t) ect] = p [w(t AOC) ] p | wie ce) | (13) 
In general, Boe a 7 0, so that this cannot be. 
k 
xa (S) 


In order to achieve conditional independence for general transmissions, it is necessary to seek two 
new variables f(t) and g(t) through a linear transformation of w(t) and wo(t). If we assume that the 


noise has uniform spectral density in its bandwidth B, and that the waveforms f(t) and g(t) are 
sampled only at intervals of 1/B, there exists a family of transformations which will give the desired 
conditional independence. The key transformation involves x(k)(t) itself: 


2) (4) ao) (+) eae) (t) 
= w + w 
r k (t) Fs) r k (t) (4 
x6) (4) x) (4) (ah) 


OO) = (t) = (t) 
Bee lees 3 Gar 


2 2 
2) (4) = [6 | ‘ Excu 


Any further "rotations" will leave conditional independence wnaffected: 
ee (4) @ as) + vg) 


165 


We are now in a position to give detailed expressions which approximate the p [2 (4), gi) (5)7, 
x(k) (t) | from observation of xX) (t) and w(t) at their sampling points only. Letting the sampled 
independent pairs cuee or peo 85 (k) be denoted by af) and ae » we have, from the m1tiple- 
order gaussian distribution, 1° 


p [worse gw fa) (6) | 


Uy 9Uy | seeast, Ga) 
(k)ij 
=n | y(k) J-2 ist “oil (x) (i), (ie) 00 | 
= (2n) n | yi | em | -a/e aay Fc rac] E ae aa 
y(X) iJ is the cofactor of mK) = a0) 0) = yo) yk) in the matrix () 5 
ij 5 ee eee n 
x6) (4) x6®) (4) 
and |u| is ats determinent. 
Kacy. ek) (k) 
wl hgh ol ee ae ta ate aha 
(k) (k) (k) 
gee ee ge 
My, i e . 5 ee ° Pesce (17) 
(k) _ (k) (k) 
aos 3 ts eee 
We find, 
ni) = oe = Bs + N34; (18) 


where as; = g ((i-5)/8 | » N ae noise power, and $55 is the Kronecker delta-function: 854 =i; 
; k : ; 

§ a O, i¢ j. From (14), ry is the amplitude of the envelope of x(t) at the i-th sampling point. 

A simplification results if x(t) is an amplitude-modulated carrier: x, (t) =C x(t). Then, from(1}), 


(k) | (x) 

(k) (kk) (k). (k) ay. Sa 
vy Bs sg a Le Be | “ei¥53 * Weis | rte ty (ao) 

! sj 


si 


Thus, in this case the transformation (14) is not necessary to achieve conditional independence. 
The +1 factor on the right side of (19) may easily be included in the yk)2 J ; 


166 


After the p [u,v/x™ (4)] have been computed by (16), the direct probabilities P [x (4) Paes 
follow from Baye's Theorem, as in (12). A schematic diagram of the overall operation of the prob- 
ability-computing receiver appears in Figure 1. The one approximation made in the preceding analysis 
has been the neglect of possible information present in the waveforms at other than sampling points. 


This difference, however, becomes negligibly small as the points are taken closer together by letting 
the noise bandwidth B increase. 


C. Receiver Simplification at Small Signal-to-Noise Ratios 
(k)ij 


In general, the inversion of the high-order matrix X), required to obtain the M) 


tedious processe In the limit as n-- for a fixed length of obaeryeuies T, the sampling points 
become so dense that all the information is extracted. Inversion of My then involves the solution 


of an integral equation. While such infinite-order matrices have tes ie been inverted, such as 
those considered by Reich and Syerlings” the mi considered here apparently do not yield a tractable 


solctione It has been possible, however, to obtain an approximation to the ideal receiver for small 
signal-to-noise ratios which permits considerable simplification of operation. 


esa 


Let us assume, as before, that the additive gaussian noise has power N and is uniform over the 
bandwidth B, and that the sample points are taken 1/B apart. We shall consider the case of general 


transmission,.and hence deal with £\")(4) ana g\)(), obtained from (1))., From,(10) and (41) a6 
is easily shown that 


p [ a2 g); ef) /ana elle) (0) any. Aes ] 


(20) 


-n 1 
(2nN) v= { 7 ON civil 


(ES SE) ey ee) 
stn hes Veika ) + (g; - ) ]} 


Expanding the squares, we obtain linear and square terms in Vos and ae 4° Analysis of the variances 


of these terms shows that the contributions of the square terms are negligible compared to the linear, 
providing the signal-to-noise ratio in the band occupied by the signal is small compared to unity. 
Thus, 


p [an 26), ean al) ay sayy | (21) 


7 Pac Coals 7 2 C1 Ck i) (1) y 
eho EL EPY LP Y)) oe A ra Pa 


where & indicates "approximately proportional to". In order to obtain p [ ana rf), el!) fan ; efi) res <0) | 
from (21), we must average (21) over the gaussianly-distributed ve and Voie ngecoieees the similarity 


of this exponential average to the characteristic function, and using the result given by Rice!” for 
the gaussian case, we obtain 


167 


pan of, ean a] g 


(22) 
LG) aa Jae [a (x), (i) 06) | 
at ON Pate ] a 1) vol ey se f; f; * & 85 
(k)_(k) 
ree 8, } 
in the limit asn~>-, 
t T 
» [24,0 ey eee) | x ole § G2) eres Meee) 
0 (6 
(23) 
T T 
B(t,-t,)dt, dt, + ‘ § ere (4, de (ty ye (e,9(ty-t, | 
0 0 


where A = 1/B is the sampling interval. Eq. (23) applies equally well to the f* (kK) (4) and g~ (K) (4) 
of (15). When x(t) is an amplitude-modulated carrier, a relation similar to (19) yields 


p [2 (ey 2) (4) (4) | m 


1 
exp { >> 
iS Ne 

T 
=a) 
6 


k : : 
where ni (+) has the magnitude of 2) (4) and the sign of x) (4), Thus for the AMcase, trans- 
formation (1) is not necessary. 


(2h) 


Nan | 
Ons 


w(t), (t, R) (+, Rh (t,) p(t, - t,)at, dt, 


ON 


w(t w(t, R™ (4, R™ (4,) p (t, - t,)dt, dt, ; 


Evaluation of the integrals in (23) and (2h) could be performed by an analogue device using linear 
filters and miltipliers. We invoke the fact that if F(s,t) is a symmetric function, that is, F(s, t) = 
F(t,s), in the range O< t < T andO<s<T, then 


rot tT § 
\ S F(s,t)dsdt = 2 ) , F(s,t) dtds (25) 
0 0 pad 


168 


By inspection, the integrals of (23) and (2h) satisfy the conditions, for f(t-s) = f(s-t) . 
Thus, for example, the first integral of (2) becomes 


- a 
) w (4B (4,) ) w(t, )R) (4, b(t, - ty) dt, dt, 


The analogue device would multiply w(t) by nO) (4), pass the product through a filter with impulse 


response h(t) = O(t),t > 0, form a second product using the first and filtered products, and finally 
pass the second product into an integrating filter. The output of the integrating filter at time T 
is then the value of the integral. A schematic diagram of the operation appears in Fig. 2. 


Transformation (1) may also be performed by analogue. Let x) (4) be clipped and filtered to ~ 
form a new waveform x) (4) having unit amplitude and preserving the phase modulation of x) (4), 
Then if the double-carrier-frequency terms are filtered from the product w(t)x (4), £6) (4) is 
obtained. Similarly, multiplication of w(t) by x) (4) after the latter has been passed through a 
nine ty-degree phase-shift network yields a’) (+). It is not necessary to preserve absolute phase in 


tthe xik (+) stored at the receiver, for phase shift merely performs the transformation (15), to which 
‘the probability expressions are invariant. 


Conclusions 


A new type of channel disturbance, having a mltiplicative nature, has been analyzed, and the 
appropriate ideal receiver has been synthesized. It has further been possible to simplify the re- 
ceiver to elementary analogue operations, providing the signal-to-noise ratio is sufficiently small. 
The success of the analysis rests almost wholly on the elegant properties of the gaussian distribu- 
tion function, and it is not expected that generalization to other mltiplicative disturbances could 
be accomplished. Application of the results of this paper is an open question, but construction of 
a laboratory model of the transmitter, channel and receiver might prove interesting. 


References 


1. C. EB. Shannon and W. Weaver, The Mathematical Theory of Communication, University of Illinois (199) 


2. P.M. Woodward and I. L. Davies, Information Theory and Inverse Probability in Telecommunications, 
Proc. IEE, III, p. 37 (March 1952) 


3. J. H. Van Vleck and D. Middleton, Jour. Appl. Phys. 17, 940 (196) 


he E. Reich and P. Swerling, The Detection of a Sine Wave in Gaussian Noise, Jour. Appl. Phys. 2,289 
(1953) 


5. H. Ge Booker and W. E. Gordon, A Theory of Radio Scattering in the Troposphere, Proc.IRE 38,01 
(1950) 


6. He Ge Booker, J. A. Ratcliffe, D. H. Shinn, Diffraction from a Random Screen with Applications to 
Ionospheric Problems, Phil. Trans. 22,579 (1950) . 

7. S. O. Rice, Statistical Fluctuations of Radio Field Strength Far Beyond the Horizon, Proc. IRE 
41,274 (1953) 

8. Lord Rayleigh, Phil. Mag. (6) 37,321 (1919). Also Scientific ey 6,604 (1920) 

9. Ss O. Rice, Mathematical Analysis of Random Noise, BSTJ, 24, 16-156, Sec. 3.7 (1915) 

10. S. 0. Rice, Mathematical Analysis of Random Noise, BSTJ, 23, 282-332, Sec. 2.9 (19h). 


169 


LINEAR TRANSFORMATION 


BY x6 Xt),xg! Xt) 


[NOT REQUIRED IF 
x(t) 1S AM] 


MATRIX OPERATION 0) 


exp D 
wiTH xt), xf Mt) ! 


WMI V=ErPW 


T 
LINEAR TRANSFORMATION a 
COMPONENT BY x7t),x Xt) mM MATRIX OPERATION| , 
[NOT REQUIRED IF LI} WITH x62), «2%¢t) 
x(t) IS AM] R | 


w,(t) sin w,t 


tw,(t)cos wot 


LINEAR TRANSFORMATION 
BY x,t), xn) 


[NOT REQUIRED_ IF 
x(t) IS AM] 


MATRIX OPERATION 


WITH x f(t) x ct) 


TO COMPUTERS FOR OTHER 1) 


Fig. 1 - Probability-computing receiver for scatter channel. 


A FILTER 
h(to)(— om 


TO 


Ws(t) SAMPLER 


FILTER 
h(t))=¢ (t) 


Fige 2 = Element of approximate probability-computing receiver. 


170 


Pre tey/with) PEK ew) 


Pty /w(t)] 


THE THEORY OF SIGNAL DECTECTABILITY * 


W. W. Peterson, T. G. Birdsall, and W. C. Fox 
University of Michigan 
Ann Arbor, Michigan 


ABSTRACT 


The problem of signal detectability treated in this paper is the following: Suppose an observer 
is given a voltage varying with time during a prescribed observation interval and is asked to decide 
whether its source is noise or is signal plus noise. What method should the observer use to make this 
decision, and what receiver is a realization of that method? After giving a discussion of theoretical 
aspects of this problem, the paper presents specific derivations of the optimum receiver for a number 
of cases of practical interest. 

The receiver whose output is the value of the likelihood ratio of the input voltage over the ob- 
servation interval is the answer to the second question no matter which of the various optimum methods 
current in the literature is employed including the Neyman - Pearson observer, Siegert's ideal observer, 
and Woodward and Davies! "observer." An optimum observer required to give a yes or no answer simply 
chooses an operating level and concludes that the receiver input arose from signal plus noise only 
when this level is exceeded by the output of his likelihood ratio receiver. 

Associated with each such operating level are conditional probabilities that the answer is a false 
alarm and the conditional probability of detection. Graphs of these quantities, called receiver 
operating characteristic, or ROC, curves are convenient for evaluating a receiver. If the detection 
problem is changed by varying, for example, the signal power, then a family of ROC curves is generated. 
Such things as betting curves can easily be obtained from such a family. The operating level to be 
used in a particular situation must be chosen by the observer. His choice will depend on such factors 
as the permissable false alarm rate, a priori probabilities, and relative importance of errors. 

With these theoretical aspects serving as an introduction, attention is devoted to the derivation 
of explicit formulas for likelihood ratio, and for probability of detection and probability of false 
alarm, for a number of particular cases. Stationary, band-limited, white Gaussian noise is assumed. 
The seven special cases which are presented were chosen from the simplest problems in signal detection 
which closely represent practical situations. 

Two of the cases form a basis for the best available approximation to the important problem of 
finding probability of detection when the starting time of the signal, signal frequency, or both, are 
unknown. Furthermore, in these two cases uncertainty in the signal can be varied, and a quantitative 
relationship between uncertainty and ability to detect signals is presented for these two rather 
general cases, The variety of examples presented should serve to suggest methods for attacking other 
simple signal detection problems and to give insight into problems too complicated to allow a direct 
solution. 


1. INTRODUCTION 


The problem of signal detectability treated in this paper is that of determining a set of optimum 
instructions to be issued to an "observer" who is given a voltage varying with time during a prescribed 
observation interval and who must judge whether its source is "noise" or "signal plus noise." The © 
nature of the "noise" and of the "signal plus noise" must be known to some extent by the observer. 

Any equipment. which the observer uses to make this judgement is called the "receiver." Therefore 
the voltage with which the observer is presented is called the "receiver input." The optimum instruc- 
tions may consist primarily in specifying the "receiver" to be used by the observer. 

The first three sections of this article survey the applications of statistical methods to this 
problem of signal detectability. They are intended to serve as an introduction to the subject 
to those who possess a minimum of mathematical training. Several definitions of "optimum" instructions 
have been proposed by other authors.~ Emphasis is placed here on the fact that these various defini- 
tions lead to essentially the same receiver.. In subsequent sections the actual specification of the 
optimum receiver is carried out and its performance is evaluated numerically for some cases of practi- 


cal interest. 


' * The work reported in this paper was done under U.S, Army Signal Corps Contract No. DA - 36 - 039 
sc - 15358. 


171 


1.1 Population SN and N 


Either noise alone or the signal plus noise may be capable of producing many different receiver 
inputs. The totality of all possible receiver inputs when noise alone is present is called "Population 
N"; similarly, the collection of all receiver inputs when signal plus noise is present is called "Popu- 
lation SN." The observer is presented with a receiver input from one of the two populations, but he 
does not know from which population it came; indeed, he may not even know the probability that it arose 
from a particular population. The observer must judge from which population the receiver input came. 


1.2 Sampling Plans 

A sampling plan is a system of making a sequence of measurements on the receiver input during 
the observation interval in such a way that it is possible to reconstruct the receiver input for the 
observation interval from the measurements. Mathematically, a sampling plan is a way of representing 
functions of time as sequences of numbers, ‘The simplest way to describe this idea is to list a 
few examples, 


A: Fourier Series on an Interval Suppose that the observation interval begins at time ty and is 
T seconds long, and that each function in the population SN and N can be expanded in a Fourier series 
on the observation interval, The Fourier coefficients for each particular receiver input can be ob- 
tained by making measurements on that input, which can in turn be reconstructed from these measurements 
by the formula 


a 2m7nt 
x(t) =a, + 5 a, cos p+ by sin 7M, toc t<t, + 7. (1) 
n=| : 
Thus the process representing each function x(t) by the sequence of its Fourier coefficients (ap, ays 
Dl, eee » Any by, ---) is a sampling plan in the sense described above. 

The pair of terms in the Fourier series which involve the cosine and sine of 2mnt/T is of 
frequency n/T cycles per second. Suppose that for a particular population of receiver inputs the terms 
of frequency greater than noa/T are zero; i.e., the population is bandlimited in the Fourier series 
sense or simply "series-bandlimited." For such a population the process of representing each receiver 
input x(t) by the finite sequence (a9, a], by, eee 5 ano» bn? is a finite sample plan,* 


B: Shannon's Sampling Plan Suppose that the observation interval includes all time and that the 
populations are "transform-bandlimited" to a band from 0 to W cycles per second, i.e., the Fourier 
transform of every receiver input is zero for frequencies greater than W. A sampling plan for this 
population is to represent each function x(t) by its amplitude measured at times spaced 1/2W seconds 
apart, coe X(tQ-n/2W), coe 5 X(to-1/2W), x(ty), x(totl/QW), ... x(to+n/2W), ...). In this case the 
formula‘ for the reconstruction of the receiver input is 


sin 7 (2W(t-t,) - n) 


@ 
x(t) = 2 xltot ay) “STAN Eoy =n) 


n=-@ 


The instants of time to+n/2W are called sampling-times, Each choice of to between 0 and 1/2W yields 
a different sampling plan. If the observation interval again includes all time, but the populations 
are transform-bandlimited to a frequency band from fy-W/2 to £, +W/2 which does not contain zero 
frequency, then each receiver input x(t) can be considered as an amplitude and frequency modulated 
waveform, x(t)=r(t) cos ( 2rf t+@(t) ); r(t) is the amplitude of the envelope and @ (t) is the 
instantaneous phase of the carrier. A sampling plan employing sampling-times is obtained in this 
case by representing each receiver input by the sequence ( ... r(to), @ (to), oo» 5 r(to+nW), 

(e] (t anf) 9 00 } of envelope amplitudes and carrier phases measured at sampling-times spaced by 
1/W seconds apart.+ The reconstruction of the receiver input from this sequence is given by 


@ R = 
x(t) = > 2(tot®) ° cos(21 tot + 6 (tox fe ol aCe ) a ° (3) 
n-- : 


C: Sampling Plan Using Sampling-Times for a Finite Observation Interval Only functions known 


for all times have Fourier transforms, and therefore the hypothesis that the populations are transform- 
bandlimited applies only when the observation interval includes all time. If the observation interval 
is of finite length and if the populations are series-bandlimited, then there are sampling 


* A sampling plan is finite if there is a finite maximum length for the sequences for all receiver 
inputs in the population. 


172 


plans utilizing sampling-times which are similar to those described in paragraph B for transform-band- 
limited populations and an infinite observation interval. Suppose that time is measured from the be- 
ginning of the observation interval, which is T seconds long, and suppose that the populations are 
series-bandlimited from 0 to W cycles per second. A finite sampling plan for this situation can be 

i re by representing each receiver input by the sequence of its amplitudes measured 1/2W seconds 
apart, 


al 
(x(G5)5 x(t, + Bale eoe 9 x(t, +T- =) ) (4) 
and the reconstruction'of the receiver input from this sequence is 
awt - 1 : 
EEE ay x(t, +B) Sat aos: eo (5) 
n=0 OWT sin( po 2 or) 


Again each choice of the (initial) sampling-time to between O and 1/2W yields a different sampling 
plan. In a similar fashion, if the observation interval is unchanged but the populations are series- 
pee ee _ interval to Pep y oats band from fo-W/2 to f,+w/2 which does not include zero 
requency, en each receiver input can be represented by a finite sequence (r(t a(t al Guy eatl 
O(totl/W), «0. , F(t +T-1/Al), O(to+T-1/W) ) of envelope amplitudes and ee oe apes pale 
sample points 1/W seconds apart; t, is again used to denote the initial sampling time which may be 
chosen anywhere from 0 to 1M. The reconstruction of the receiver input from this sequence of measure- 
ments is given by 


WT-1 
inure Wict=tq)a=en) 
x(t) = > r(t 42) cos (2 7 £.t+0(tot) of ie a Om ticles (6) 
ae ( ss ofi)) WI sin 7 W(t-to) -n ; 
n=O . wr 


From these examples it can be seen that there are a number of important differences between 
various sampling plans such as i) the length of the observation interval, ii) whether sampling-times 
are employed, and iii) whether the measurements are all to be of the same kind, e.g., instantaneous 
amplitude measurements, or of different kinds, e.g., envelope amplitude and carrier phase. However, 
they all have in common the property that the receiver input can be reconstructed from the measure- 
ments made on it. 

The role which the sampling plan plays in the theory presented in this paper is primarily one of 
mathematical convenience. The populations N and SN will be represented as sequences through the use 
of sampling plans in order to apply statistical methods. Once an answer is obtained concerning an 
"optimum" receiver, it is often possible to translate this answer back to the more familiar language 
of receiver inputs, If a finite sampling plan is not available for a particular application of the 
theory, then recent work by Grenander? shows that the desired parameters of the "optimum" receiver 
can be approximated by using finite sampling plans. Both for this reason and in order to simplify 
the exposition, the theory presented here is restricted to cases where finite sampling plans are 
available. 


2. OPTIMUM TESTS ON FIXED OBSERVATION INTERVALS 


2.1 Probability Density Functions 


This part.of the paper is concerned with a method of statistical analysis which requires for raw 
data a finite sequence of numbers (x1, oe ecie ty Xi) which is the result of the measurements made at 
the receiver input according to some particular finite sampling plan. The sequence is often called a 
"sample" of the population from which it arose, and is denoted by a single letter; thus, if the re- 
ceiver input is x(t), and the sampling plan yields a sequence (xy, ol eveler™s Kr) then this sequence 
is called the sample X, The theory to be developed here is intended to specify an optimum receiver 
and is couched in the language of samples, X = (x5 Ky, eee as If n is very large, a receiver 
which had to make the measurements called for by a sampling plan would certainly be impractical, How- 
ever, this practical difficulty is avoided when the specification of the receiver is translated back 
from the language of samples to the language of the receiver inputs; this can be done because it is 
possible to reconstruct the inputs from the samples. 

For the purposes of the subsequent development any finite sampling plan may be considered provided 


173 


enough properties are known of the associated sample X so that certain probabilities may be calculated. 
Specifically, the probability density functions fy(X) and fsy(X) of the sample variable X for the cases 
when X is drawn from populations N and SN respectively must be known.* The two basic properties of 
density functions are 


fy(X) 2 0 fence) ax = 2, 
and oe) 


Pau(x) i220 ftsy(X) ax = 1 


where the integration symbol represents the multiple integral taken over the entire range of the sample 
variable X s (q; XO» eee 9 nese 


222 The Concept of a Criterion 


Consider now an observer who has as available data the sample X = (xy; eee ). The observer's 
job is to judge for each sample whether or not it was taken from population SN, Although it is not 
possible to determine the (probably subconscious) criterion used by the observer, it is quite possible 
to find an external manifestation of it. Ideally all that is necessary is to submit each possible 
sample to the observer and to record his judgement. This will yield a tabulation of those samples 
which the observer decided were drawn from population SN. If any other observer is given this tabula- 
tion and instructed to base his decisions on it, he will behave exactly as did the first observer. 
Thus, the tabulation of these responses can be used to replace the mental criterion employed by the 
observer. Such a tabulation will also be called a criterion and will be denoted by the letter A, 
which refers to the phraseology common in statistics of "Accepting the hypothesis that a signal is 
present." The tabulation of the remaining samples, those which the observer concluded were drawn from 
population N, will be denoted by B. 


23 Probabilities Associated with Criteria 


There are of course as many different criteria as there are observers. Among all possible cri- 
teria it is necessary to select those that are best for various purposes. To do so, certain numerical 
quantities must be associated with each criterion, It will be necessary.jto know the probability that 
a sample from one of the populations wiil be listed in a particular criterion A. According to the 
standard definitions, these probabilities are given by 


Poy(A) =f fgw(X) ak 
and (8) 
Py(A) = S eye) ax 


where the mltiple integral is taken over all samples listed in the criterion A, 

For Ree a pagticular SEC plan might have a density function of the form fy(x 1, xX, oe 5 
Xn) = K exp-(xy* + xo° + ... + x,°). A possible criterion would consist of those samples 
Xs (x), Xo, cos 5 si which lie outside a sphere of radius one centered at the origin. Then the 
integral would be taken over the exterior of this sphere. 

These probabilities have a special significance. Py(A) is the conditional probability that a 
sample from population N will be listed in criterion A, that is, will be judged as a sample from popu- 
lation SN, Thus Py(A) = F is the conditional false alarm probability. Also, Pgy(A) is the conditional 
probability of a certain kind of correct response called a hit (that of judging correctly that a 
sample is from population SN). The conditional probability of judging falsely that a sample is from 
population SN is therefore given by 1 = Psy(A) = M, the conditional probability of a miss. The only 
errors which can occur are false alarms and misses; their conditional probabilities, F and M, are call- 
ed briefly the error probabilities. 

A reader familiar with the formal content of probability theory should note that these quantities 


* In this discussion it should be kept in mind that "the event of the sample being drawn from popula- 
tion SN" corresponds to signal and noise being present at the receiver input. Also "the event of pop- 
ulation SN being sampled" means the same thing. 


17 


are true conditional probabilities; the first is conditional on the sample being drawn from population 
SN; the second is conditional on its being drawn from population N. This is to distinguish them from 
a priori probabilities (the probabilities that a certain population will be sampled, for example) 
which are not as yet assumed known, 


° 


2.): Likelihood Ratio and the Ratio Criteria 


It is convenient to introduce a new function called the likelihood ratio, ¢(X), defined as the 
ratio fsn(X)/fy(X) for sample points X = (x), ... » x,)3 @(X) represents the likelihood that the 
sample X was drawn from SN relative to the likelihood that it was drawn from N. Hence, if @(X) is 
sufficiently large, it would be reasonable to conclude that X was in fact drawn from population SN, 
i.e., that X should be listed in the desired "best" criterion. Thus, for each number B 2 O, a certain 
criterion A (#) will be selected; A( B) is chosen by listing each sample X for which £ (X) 2 B . The 
problem then reduces to that of making a wise choice of B 3 that is, to determine how large "suffi- 
ciently large" is, Criteria of the form A(fB) will be called ratio criteria. 

A number of writers have presented varying definitions of a criterion being "optimum." It turns 
out that each of these optimum criteria can be expressed as a ratio criterion, so that a receiver de- 
signed to yield likelihood ratio as output could be used with any of then. 


2.5 Weighted Combination Criteria 


Suppose it is possible to assign a certain number w as a weighting factor representing the importance 
of a false alarm relative to a hit. Since Pgy(A) is the probability of a hit, and Py(a) the propabili- 
ty of a false alarm, it would then be reasonable to find a criterion A which maximizes the quantity 


Pox (A) - wPy(A). | (9) 
But this quantity can be written as 
f [snc - wy (20)] aX (10) 


where the integration is taken over the sample points X listed in A. To maximize this integral, one 
would list in A every sample for which the integrand was not negative. Solving that inequality for 
wW, one sees that A should contain those sample points X for which 


so) 
L(x) = rae ei (12) 


Thus the desired criterion A is simply A(w), and so it is a ratio criterion. 


2.6 Neyman-Pearson Criteria 


If it is critically important to keep the probability of a false alarm Py(A) below a certain level 
k, then it would be reasonable to choose from among such criteria that one which maximizes the proba- 
bility of a hit. Thus Neyman and Pearson proposed*+as a type of optimum criterion any criterion A, 
for which 


(1) Py(A,) < k, and 
(2) Pony (hic) is a maximum for all the criteria A with the property Py(A) < k. 


The Ak type criterion can also be expressed as a ratio criterion. This can be made plausible as 
follows. To begin with, it is necessary to consider only those criteria A for which Py (A) = k, be- 
cause A will be taken as large as possible in order to meet condition (2). Now consider the curve 
given parametrically by the equations 


X = X(B) = Py (A(B)) 
and 
Y= ¥(B) = Poy (A(B)). (12) 


This curve will be called the Receiver Operating Characteristic (briefly, ROC) curve, for a receiver 
whose output is likelihood ratio and with which ratio criteria are being used. 

The ROC curve passes through the points. (0, 0) and (1, 1), the first at B= @, the second at 
B=0. At B= 0, 2(X) 2B = 0 for all X, so A(O) consists of all possible samples, Thus the obser- 


ver will report that every sample is drawn from SN, so he will be certain to make a false alarm and to 
make a hit, (This assumes that the samples will not be drawn exclusively from one of the populations.) 


175 


This can be verified, using the basic property of the density functions expressed by the following equa- 
tions: 


Poy (A(O)) = f Fon) ax = 1 


and (a8) 
Py (A(0)) = J tye) ak = 


where the integration is taken over all’ possible samples X. These equations mean that X(0) = Y(0) = 1. 
Moreover, X(~) = Y(@) = 0, because for B =@ there are no samples X with £(X) >; i.e., A(@) con- 
tains no samples at all and the operator will never report a signal is present. Therefore the operator 
cannot possibly make a false alarm nor can he make a hit. Thus Psy(A(@)) = 0 and Py(A(M)) = 0. 

These considerations, together with those of the next section, show that the ROC curve can be 
sketched somewhat as in Fig. 1. 


Q=(X,Y) 
@ 
<= 
z 
£ 
ical 
"V 
: X(B) = Py(a(Q)) 


FIG. 1. TYPICAL ROC CURVE 


To determine the desired A,, recall that all probabilities lie between zero and one, so that 
Py(Ax) = k is between zero and one. Then there is a point Q of the ROC curve which lies vertically 
above the point (k,0). The coordinates (X,Y) of Q are X = Py(A(B)) = k and Y = Poy(A(B)), for some B, 
which will be written By. Now A(x) satisfies condition (1) because Py(A(Bx)) = k, and therefore A(By) 
will be the desired A, if Poyn(A)£¢ Pon(A(8)) for any criterion with the property that Py(A) =k. From 
paragraph 2.5, it is clear that the ratio criterion A(By) is. an optimum weighted-combination criterion 
with the weighting factor w= 6;. Therefore, if w= By, the weighted-combination using the criterion 
A(Bx) is greater than or equal to the same weighted-combination using any other criterion A, i.e., 


Poy (A(B)) - Bk Py(A(By)) 2 Poy(A) - Bx Py(A) (14) 


In this case both Py(A(P,)) and Py(A) are equal to k. If this value is substituted into the inequality 
above, one obtains 


Poy(A(B)) 2 Poy(A)- (15) 


Therefore, the desired Neyman-Pearson criterion A, should be chosen to be this particular ratio 
criterion, A(x). 


2. ROC Curve 


It is desirable to digress for a moment to study the ROC curve more closely. Its value lies 


176 


in the fact that if the type of criterion chosen for a particular application is a ratio criterion, 
A(B), then a complete description of the detection system's performance can be read off the ROC curve. 
By the very definition of the ROC curve, the X coordinate is the conditional probability, F, of false 
alarm, and the Y coordinate is the conditional probability of a hit. Similarly (1-X) is the condition- 
al probability of being correct when noise alone is present, and (1-Y) = M is the conditional probabil- 
ity of a miss. It will be shown in a moment that the operating level 6B for the ratio criterion A( ) 
can also be determined from the ROC curve as the slope at the point 


(Py(ACB)), Poy (ACB ))) . 


Since most preposed kinds of optimum criteria can be reduced to ratio criteria, the ROC curve assumes 
considerable importance. 


In order to determine some of its geometric properties, it will be assumed that the parametric 
functions 


X = X(B) = Py(a(B)) 
and (16) 
Y = Y(B) = Poy (A(B)) 
are differentiable functions of B - The slope of the tangent to the ROC curve is given by the quotient 
(a¥/dg)/AX/dg. To calculate the slope at the point (x(B,), x(B_)), notice that among all criteria A, 
the quantity Pgy(A) - B Put) is maximized by A = A( B,)- Thefefore, in particular, the function 
¥(B) - BoX(B) = Psy(A(B)) - AsPy(A(B)) (17) 


has a maximum at B = Bo, s0 that its derivative must vanish there. Thus differentiating, 


HB -0 wp -B,. (18 

Solving for By, one obtains 
Bo = (3 = Bo = the slope of the tangent to the ROC curve at the point (X(Bo), Y¥(Bo)) . 
(Bb =e, (a9) 


This shows that the slope of the ROC curve is given by its parameter B, and so is always positive. 

Hence the curve rises steadily. In addition, this means that Y(B) can be written as a single valued 
function of xX(B), Y = Y¥(X), which is monotone increasing, and where Y(0) = O and Y(1) = 1. These remarks 
make fully warranted the sketch of the ROC curve given in Fig. 1. The next two sections are concerned 
with determining the best value to use for the weighting factor w when a priori probabilities are known. 


2.8 Siegert's "Ideal Observer's" Criteria 


Here it is necessary to know beforehand the a priori probabilities that population SN and that 
population N will be sampled. This is an additional assumption. These probabilities are denoted 
respectively by P(SN) and P(N). Moreover, P(SN) + P(N) = 1 because at least one of the populations 
must be sampled. The criterion associated with Siegert's Ideal Observer is usually defined as a 
criterion for which a priori probability of error is minimized (or, equivalently, the a priori proba- 
bility of a correct response is maximized). Frequently the only case considered is that where P(SN) 
and P(N) are equal, but this restriction is not necessary. 

Since the conditional probability F of a false alarm is known as well as the a priori probability 
of the event (that population N was sampled) upon which F is conditional, then the probability of a 
false alarm is given by the product 


P(N)F . (20) 
In the same way the probability of a miss is given by 
P(SN)M . (21) 


177 


Because an error E can occur in exactly these two ways, the probability of error is the sum of these 
quantities 


P(E) = P(N)F + P(SN)M . (22) 


It has already been pointed out that F = Py(A) and M = 1-Poy(A). If these are substituted into 
the expression for P(E) a simple algebraic manipulation gives 


P(E) = P(SN) - P(SN) [Psw(A) - z(N) » Py(A) |: : (23) 


It is desired to minimize P(E). But from the last equation this is equivalent to maximizing the 


quantity 
Poy(A) - EA) « Py(A) (24) 


and, of course, this will yield a weighted combination criterion with w = P(N)/P(SN), which is known to 
be simply a ratio criterion A(w). 


2.9 Maximum Expected-Value Criteria 


Another way to assign a weighting factor w depends on knowing the "expected value" of each 
criterion. This can be determined if the a priori probabilities P(SN) and P(N) are known, and if 
numerical values can be assigned to the four alternatives. Let Vp be the value of detection and Vg 
the value of being "quiet", that is, of correctly deciding that noise alone is present. The other 
two alternatives are also assigned values, Vy, the value of a miss, and Vp, the value of a false alarm. 
The expected value associated with a criterion can now be determined. In this case it is natural to 
define an optimum criterion as one which maximizes the expected value. It can be shown that such a 
criterion maximizes 


Psy(A) - [ Btw _Vq - Vr ] x (25) 
P(SN) Vp - Vy 
By definition (see paragraph 2.5), this criterion is a weighted combination criterion with weighting 
factor 
we oh _¥q = YF (26) 
P(SN) Vp - VM 


and hence a likelihood ratio criterion. Seigert's "Ideal Observer" criterion is the special case for 
which Vg - VF = Vp - VM. 


2.10 A Posteriori Probability and Signal Detectability 


Heretofore the observer has been limited to two possible answers, "signal plus noise is present" 

or "noise alone is present". Instead he may be asked what, to the best of his knowledge, is the proba- 
bility that a signal is present. This approach has the advantage of getting more information from the 
receiving equipment. In fact, Woodward and Davies point out that if the observer makes the best pos- 
sible estimate of this probability for each possible transmitted message, he is supplying all the in- 
formation which his equipment can giye hin. A good discussion of this approach is found in the ori- 
ginal papers by Woodward and Davies.°»? Their formula for the a posteriori probability, Py(SN), becomes, 
in the notation of this paper, ; 


fgy(X) P(SN) 


Px (SN) = Fon(X) P(SN) + Acs P(SN) ) fyy(X) : o (27) 


1 (X) P(SN 
Px (SN) = n 2 P e + 1 -P(SN (28) 


If a receiver which has likelihood ratio as its output can be built, and if the a priori probability 
P(SN) is known, a posteriori probability can be calculated easily. The calculation could be built into 
the receiver calibration, since (28) is a monotonic function of £(X); this would make the receiver an 
optimum receiver for obtaining a postertori, probability, 


178 


3. SEQUENTIAL TESTS WITH MINIMUM AVERAGE DURATION 
3.1 Sequential Testing 


The idea of sequential testing is this: make one measurement x, on the receiver input; if the 
evidence x, is sufficiently persuading, decide as to whether the receiver input was drawn from popula- 
tion SN or from population N. If the evidence is not so strong, make a second measurement X and con- 
sider the evidence (x, X5). Continue to make measurements until the resulting sequence of measurements 
is sufficiently pursuading in favor or one population or the other. Obviously this involves the theore- 
tical possibility of making arbitrarily many measurements before a final decision is made, This does not 
mean that infinitely many measurements must be made in an actual application, nor does it necessarily 
mean that the operation might entail an arbitrarily long interval of time. If ina particular applica- 
tion measurements are taken at evenly spaced times then the "time base" of such a measurement plan is 
infinite. However, another plan might call for measurements to be made at the instants t = 0, t = 1/2, 
eon p t= (n-1)/n, ... and as these times all lie in the time interval from zero to one, such a measure- 
ment plan would have a time base of only one unit of time. 

If the measurement plan has been carried out to the stage where n measurements xX), XQ, «e+ » X, 
have been made, the variable Xp = (xz, 2B) O00 ¥} Xn) is called the n*h stage sample variable. , A 
specific plan for measurements will Pec onesceree only if for each possible stage n, the two density 
functions fgy(X,) and fy(X,) of the n stage sample variable X, are known; the first of these density 
functions is applicable when population SN is being sampled and the second is applicable when popula- 
tion N is being sampled. These density functions may very well differ at different stages, so that 
they should be written fy(Xn) and Pent Xn) 3 however, the n appearing in the argument Xp, should always 


make the situation clear, and the superscript on the density functions themselves will be omitted. 


4.2 Sequential Tests 
A sequential test will consist of two things: 


1) An (infinite) measurement plan With density functions fy(Xq) and foi(Xy) 
2) An assignment of three criteria to each stage of the measurement plan. 


These three criteria represent the three possible conclusions: 


A) Signal plus noise is present, i.e. the sample comes from population SN 
B) Noise alone is present, i.e. the sample comes from population N 
C) Another measurement should be made. 


At the first stage of the measurement plan, any (real) number at all could theoretically result 
from the first messurement. This means that the first stage sample variable pe (xz) ranges through 
the entire number system, which will be written Sy to stand for the first stage sample space. Suppose 
_ the three first-stage criteria A), By, and Cj, have been chosen, If the sample X, is listed in A> the 

conclusion that a signal is present is drawn and the test terminated. If it is listed in B, the con- 
clusion is that noise alone is present, and again the test is terminated. If X, should be listed in Cj, 
another measurement will be made, and the test moves on to the second stage instead of terminating. 

When the first stage criteria have been chosen, a limitation is placed on So, the space through 
which the second stage sample variable Xp = (x), xo) ranges. The only way the test can proceed to the 
second stage is for X; = (x,) to be listed in Cj. Therefore, Sp does not contain all possible second 
stage samples Xo = (x1, xp) but only those for which (x,) is listed in Cj. Three second stage criteria, 
Ao, Bo, and Co, must now be chosen from those samples X» listed in So. They must be chosen in such a 
way that there are no duplications in the listings and no sample in So is omitted. These criteria 
carry exactly the same significance as those chosen in the first stage. That is, the three conclusions 
that a signal is or is not present, or that the test should be continued, are drawn when the sample Xo 
is listed in Ap, Bo, or Co respectively. th 

The selection of criteria proceeds in the same way. If the n”” stage criteria An, By, and C,, 
have been chosen, then the next stage's sample space S,,, consists of those samples X47 Sexy xo. 
e+ » Xn» X+1) for which Xn = (x1, x2, ... » Xn) was 1iSted in C,. Then from Sy+] are drawn the three 
(n+l) stage criteria An+l, Bn+l, and Chi}. 


179 


When an entire sequence 


of criteria is selected, a "sequential test" has been determined. This does not mean of course that 
the test will necessarily be particularly useful. jilowever, among all the possible ways of selecting 
a sequence of criteria and hence a sequential test, there may be particular ones which are very useful. 


3-3 Probabilities Associated with Sequential Tests 
If Q, is any n*h stage criterion, then the quantities* 


5 Py(Qn) be 2 fy QQ) ax, 
(29) 


Pay (Qn) = re! Foy (Ky) En 


represent the (N or SN) conditional probabilities that an nth stage sample X, will be listed in the 
criterion Q,. Conditional probabilities of particular interest are: 


1) The nth stage conditional error probabilities: 

If population N is sampled, then the probability that the sample variable X, will be listed in A, 
is Py(A,). This is the N-conditional probability of a false alarm, 

If population SN is sampled, then the probability that the sample variable X, will be listed 
in B, is Psy(Bn). This is the SN-conditional probability of a miss. 


2) The conditional error probabilities of the entire test: 


@ 

F= ) Py(A,), the N-conditional probability of a false alarm, and (30) 
n=l 
ce.0) 

M= ) Poy(By), the SN-conditional probability of a miss, (31) 
n=] 


are merely the sums of the same error probabilities over all stages. 


3) The conditional probabilities of terminating at stage n are 


TH = Py(An) + Py(B,) , ana (32) 


These equations can be justified by a simple argument. The only way the test can terminate at 
stage n is for the sample variable X_ to be listed in either A, orB. The probability of this event 
is the sum of the probabilities of the component events which are mtually exclusive since X_ can be 
listed in at most one of AL and Be se 


* The notation if indicates that the integration is to be carried out over all sample points listed 
in Qne Qn 


180 


h) The conditional probabilities that the entire test will terminate are 
y» and (3h) 


3.4. Average Sample Numbers 


There are two other quantities which must be introduced. One feature of the sequential test is 
that it affords an opportunity of arriving at a decision early in the sampling process when the data 
happens to be unusually convincing. Thus one might expect that, on the average, the stage of termination 
of a well-constructed sequential test would be lower than could be achieved by an otherwise equal, good 
standard test. It is therefore important to obtain expressions for the average or expected value of 
the stage of termination. As with other probabilities, there will be two of these quantities: one 
conditional on population N being sampled; the other conditional on population SN being sampled. They 
are given by 


Ee Lela ty (36) 


and 


Baye 2, on Poy (37) 
n=] 


The letter E is used to refer to the term "expected value." The quantities EN and Esy are called the 
average sample numbers, The form these formulas take can be justified (somewhat freely) on the grounds 
that each value, n, which the variable "stage of termination" may take on must be weighted by the 
(conditional) probability that the variable will in fact take on that value. 

It should be heavily emphasized that the average sample numbers are strictly average figures. In 
actual runs of a sequential test, the stages of termination will sometimes be less than the average 
sample numbers but will also be upon occasion much larger. Any sequential test whose average sample 
numbers are not finite would be useless for applications, Therefore the only ones to be considered 
are those with finite average sample numbers. Under this assumption,” it can be shown that Ty = Tsy=1 
so that the test is certain to terminate (in the sense of probability). On the other hand, if it is 
known that Ty = Toy = 1 it does not always follow that the average sample numbers are finite. Such a 
situation would mean only that if a sequence of runs of the test were made, each run would probably 
terminate, but the average stage of termination would become arbitrarily large as more runs were made, 


3.5 Sequential Ratio Tests 


In studying non-sequential tests using finite samples it was found that the best criterion could 
always be expressed in terms of likelihood ratio. Therefore, it may be useful to introduce likelihood 
ratios at each stage of an infinite sample plan. The n”’ stage likelihood ratio function £ (X,) als; 
defined as the ratio fgn(Xn)/fy(Xy)- Optimum criteria in the finite sample tests turned out to be 
criteria listing all samples X for which 4(X) is greater than or equal to a certain number, It should 
be possible to choose sequential criteria Ca Bn» Cn) in the same way. For each stage two numbers 
a. and b. with b. S a. could be chosen. Then the criteria (Ans Bn, Cn) determined by the numbers ay 

n n n 
and b, would be 
A lists all samples X, of the sample space S, for which £ (X,) Zea, 
Bn lists all sampies X, ofthe sample space S, for which £(X,) $ b, 
C,, lists all samples X, of the sample space Sy for which by < BX) Sane 


If criteria selected in this way meet the requirements that the average sample numbers be finite, then 
the resulting sequential test is called a "sequential ratio test." 


3.6 Optimum Sequential Tests 


* Remember that the sampling process is not assumed to yield independence among the X;. 


181 


Lteis customary8 to define an optimum sequential test as that one for which the average sample 
numbers Ey and Esy are minimum among all sequential tests with fixed error probabilities F and M, 

In addition to the formulas given in Section 3.4, alternative formulas” for the average sample 
numbers are 


il} 
=] 
+ 
M 

~ 
= 

Q 
— 


Ey (38) 
and 
e8) 
Egy = 1 + 2 PaylCa'): « (39) 


Thus, if a set of sequential criteria Gee Begone) is presented as a possible optimum test, then its 
optimum character is decided by ascertaining whether the inequalities 


D PylCy) < YE Pyl(Cy) 40) 
and 
D Poy(Cy) < Y Poy(Cy) (42) 
hold for every other set of sequential criteria {(Ap, Bn> Cn) } with the same error probabilities, i.e., 
with , ye 
DPy(Ay) = 2 PylAs) (42) 
and 
D Pon(Bi) = =D Poy (B,) 
Nei SN" i 
(43) 


The problem of constructing an optimum sequential test is difficult because the equalities (2) 
and (3) can be satisfied even when there is no apparent term-by-term relation between the sequences 
{P,(c# )} and {Py (C3 )} - Wald has proposed as optimum the tests in which each of the sequences 
an} and {Pn} is constant, that is, b; = b, and a, = for all n. Moreover Wald and Wolfowitzl° 
proved that these tests are optimum whenever the yener ey functions at successive stages are independent, 
as can be the case for example when both noise and signal plus noise consist of "random noise." How- 
ever, this "randomness" is not met with in most applications of the theory of signal detectability 
at least not in the sense that the hypotheses of Wald and Wolfowitz are satisfied. : 

Consider a test of fixed length as described in Section 2, with error probabilities F and M. Al- 
though the optimum sequential test with these same error probabilities generally requires less time 
on the average, it has the disadvantage that it will sometimes use much more time than the fixed length 
test requires. In a conversation with the authors, Professor Mark Kac of Cornell University suggested 
that the dispersion, or variance, of the sample numbers may be so large as seriously to affect the 
usefulness of the sequential tests in applications to signal detectability. Certainly this matter 
should be investigated before a final decision is reached concerning the merits of sequential tests 
relative to tests on a fixed observation interval. However it is a difficult matter to calculate the 
variance of the sample numbers. Therefore an electronic simulator is being built at the University of 
Michigan which will simulate both types’ of tests and will provide data for ROC curves of both types as 
well as the distribution of the (sequential) sample numbers, 


h, OPTIMUM DETECTION FOR SPECIFIC CASES 


h.1 Introduction 


The chief conclusion obtained from the general theory of signal detectability presented in Section 
2 of this paper is that a receiver which calculates the likelihood ratio for each receiver input is 
the optimum receiver for detecting signals in noise. 

{tt is the purpose of Section 4 to consider a number of different ensembles of signals with band- 
limited white Gaussian noise. For each case, a possible receiver design is discussed. The primary 
emphasis, however, is on obtaining the probability of detection and probability of false alarm, and 
hence on estimates of optimum receiver performance for the various cases. 

The cases which are presented were chosen from the simplest problems in signal detection which 
closely represent practical situations. They are listed in Table I along with examples of engineering 
problems in which they find application. In the last two cases the uncertainty in the signal can be 
varied, and some light is thrown on the relationship between uncertainty and the ability to detect 


182 


signals. The variety of examples presented should serve to suggest methods for attacking other simple 

signal detection problems and to give insight into problems too complicated to allow a direct solution. 
The reader will find the discussion of likelihood ratio and its distribution easier to follow if 

he keeps in mind the connection between a criterion type receiter and likelihood ratio. In an optimum _ 

criterion type system, the operator will say that a signal is present whenever the likelihood ratio 

is above a certain level B. He will say that only noise is present when the likelihood ratio is below 

B. For each operating level B, there is a false alarm probability and a probability of detection. 

The false alarm probability is the probability that the likelihood ratio £(X) will be greater than p 

if no signal is sent; this is by definition the complementary distribution function Fy(B). Likewise, 

the complementary distribution Fo,(B) is the probability that £(X) will be greater than B if there is 


signal plus noise, and hence Fey(B) is the probability of detection if a signal is sent. 


TABLE I 


Description of 


Section Signal Ensemble 


Application 


Signal Known Exactly” Coherent radar with a target of 


known range and character 


Signal Known Except for 


; Ordinary pulse radar with no inte- 
Phase * 


gration and with a target of known 
range and character. 


Signal a Sample of White 
Gaussian Noise 


Detection of noise-like signals; 
detection of speech sounds in 
Gaussian noise. 


Detecting a pulse of known start- 
ing time (such as a pulse from a 
radar beacon) with a crystal-video 
or other type broad band receiver. 


' Detector Output of a 
Broad Band Receiver 


A Radar Case (A train of 
pulses with incoherent 


Ordinary pulse radar with inte- 
gration and with a target of known 


phage ) 


Signal One of M Orthogo- 
nal Signals 


Signal One of M Orthogo- 
nal Signals Known Except 
for Phase 


range and character. 


Coherent radar where the target is 
at one of a finite number of non- 
overlapping positions. 


Ordinary pulse radar with no inte- 
gration and with a target which 
may appear at one of a finite 


number of non-overlapping posi- 
tions. 


4.2 Gaussian Noise 


i ad to be defined on a finite obser- . 
In the remainder of this paper the receiver inputs will be assune E 
vation interval, O<t<T. It will further be assumed that the receiver inputs are series-bandlimited. 
By the sampling plan C (Section 1,2) any such receiver input x(t) .can be reconstructed from sample 
values of the function taken at points 1/2W apart throughout the observation interval, i.e., 


2 Woodward and Davies' work, but here they 
Our treatment of these two fundamental cases is based upon : ' 

are treated in terms of likelihood ratio, and hence apply to criterion type receivers as well as to 

a posteriori probability type receivers. These first two cases have been solved for the more general 
roblem in which the noise is Gaussian but has an arbitrary spectrum. os Those solutions require 

oe use of. an infinite sampling plan and are consideredly more involved than the corresponding deri- 


vations in this report. 


183 


aT 
x(t) = Yo x Vx(t), (Lib) 


k=1 
where 
3 % k 
W(t) = pee Es ~ owt and x, = «(Fy ) ‘ (45) 
— k G k 
owt sint (5 — OWT ) 


Therefore the receiver inputs can be represented by the sample (X15 Xo» eee > Xow) In Section 

the notation x will be used to denote either the receiver input function x(t) or the sample (x, Xo» 

+++ » Xowp)- Similarly the signal s(t), or simply S, can be represented by the sample (s}5 ae » Sown)» 
where S, = s(k/2W). 

Only the probability distributions for receiver inputs x(t) can be specified. The.distribution 
must be given for the receiver inputs both with noise alone and with signal plus noise. The probability 
distributions are described by giving the probability density functions fgy(x) and fy(x) for the re- 
ceiver inputs x. 

The probability density function for the receiver inputs with noise alone are assumed to be 


n 4 xe 
f(x) = I — ? | a |?’ 
{=1 enN 
ae (46) 
n 


2 n 
fH) = (aby) om [- | 
al 


where n is WT and N is the noise power. It can be verified easily that this probability density 
function is the description of noise which has a Gaussian distribution of amplitude at every time, 
is stationary, and has the same average power in each of its Fourier components. Thus we shall refer 
to it as "stationary band-limited white Gaussian noise." 

The functions W(t) are orthogonal and have energy 1/2W, and therefore 


q (47) 
ya la i fee) ates’, 
@) 
so that 
2D T 3 
i 1 + et] ? (48) 
£,() = (3) exp |- Ny i x(t) 


where Ny = N/W is the noise power per unit bandwidth. 

In s practical application, information is given about the signals as they would appear without 
noise at the receiver input, rather than about the signal plus noise probability density. Then 
fsn(x) must be calculated from this information and the probability density function f(x) for the 
noise. The noise and the signals will be assumed independent of each other. 

If the input to the receiver is the sum of the signal and the noise, then the receiver input x(t) 
could have been caused by any signal s(t) and noise n(t) = x(t) - s(t). The probability density for 
the input x in signal plus noise is thus the probability (density) that s(t) and x(t) - s(t) will occur 


together, averaged over all possible s(t). If the probability of the Signals is described by a density 
function f.(s), then 


18) 


fon (x) = f ty(x-s)£g(s)as | (49) 


where the integration is over the entire range of the sample variable Sk A more general form is used 
when the probability of the signals is described by a probability measure Pg the formula in this 


case is 
foy(x) = [ty(x-s) aPg(s). (50) 


This integral is a Lebesgue integral, and is essentially an “average” of fy(x-s) over all values of s 
weighted by the probability P,. If fy(x) is taken from Eq. (6) » this becomes 


= 
1 a. 7 A 2 


foy(x) = 
at (52) 
n 
1 on > 2 ie rh 5 a thea 
Sy) exP ay 2 Xy exp |- oF 4 8, | exp] F x 555 dP, (8) 
n 
2 uh 
fon (x) = fae) dP, (8) -( Ay) gies: - = 4 [=(®)-2(+)]° a asst 
sete 
2 : (52) 
dt 
= e exp |- a J x* | exp I i f ee exp lJ xs | dP, (8) 
(0) 0) @) 


The factor exp|-(1/No)of x(t) at | = exp [4a enypxy? | can be brought out of the integral since it 
does not depend on s, the variable of integration. Note that the integral 


2 1 
of ete at = a Date = E(s) (53) 


ok 
is the energy of the expected signal, while 


ue 
al 
at x(t) s(t) dt = my 3 X48 (5h) 


is the cross correlation between the expected signal and the receiver input. 


4.4 Likelihood Ratio with Gaussian Noise 


Likelihood ratio is defined as the ratio of the probability density functions fgy(x) and fy(x). 
With white Gaussian noise it is obtained by dividing Eq, (51) and (52) vy (46) and (48) respectively. 


* 
This assumes that the circuit impedance is normalized to one ohm, 


185 


n 
a 
L(x) = Jv - Be) x - pees pe = (55) 


Bk 
(8) 2 ap 6 
Le) = Jex|- Ke Jele ee | 5) (56) 


(0) 


If the signal is known exactly or completely specified, the probability for that signal is unity, 
and the probability for any set of possible signals not containing s is zero. Then the likelihood 
ratio becomes 


e 


n 
L(x) = om | - eh] exp >) x ae |; or 
° isl . (57) 


2 dE 
L(x) 2 lew | st a x(t) a(t) ee] (58) 


j 
E 


Thus the general formas (55) and (56) for likelihood ratio state that 4(x) is the weighted average 
of £,(x) over the set of all signals, i.e., 


(59) 
L(x) = FT bes 0,6) : 


An equipment which calculates the likelihood ratio £(x) for each receiver input x is the optimum 
receiver. The form of equation (58) suggests one form which this equipment might take. First, for 
each possible expected signal s, the individual likelihood ratio £ ay is calculated. Then these num- 
bers are averaged. Since the set of expected signals is often infinite, this direct method is usually 
impractical. It is frequently possible in particular cases to obtain by math@matical operations on 
equation (58) a different form for £(x) which can be recognized as the response of a realizable elec- 
tronic equipment, simpler than the equipment specified by the direct method. It is essentially this 
which is done in the following paragraphs. 

If the distribution function P.(s) depends on various parameters such as carrier phase, signal 
energy, or carrier frequency, and if the distributions in these parameters are independent, the ex- 
pression for likelihood ratio can be simplified somewhat. If these parameters are indicated by ry) 


Toy see » Th» and the associated probability density functions are denoted by fy(r) eae a(Po)s eos 9 
f(r), then 
| d Po(s) = f,(r,) oie fy(r,) drja° *hactey. 
The likelihood ratio becomes 
£ (x) = i! eee i) £ ,() 4 (rz) eee £ (tm) dx 2 axe 
(60) 


ih Eo —[fq@) 4) ay] Jon, 


Thus the likelihood ratio can be found by averaging f£,(x) with respect to the parameters, 


4.4 The Case of a Signal Known Exactly 


The likelihood ratio for the case when the oo aes is known exactly has already been presented in 
Section ].3e. 


186 


n 
L(x) = exp |- =| exp F 2 sir | ‘ (61) 


° 
E 2 ‘ 
L(x) = exp E i | =| = i} x(t) s(t) a (62) 
ce) 
(6) : 


As the first step in finding the distribution functions for L(x), it is convenient to find the 
distribution for(1/N)2 x,s, when there is noise alone. Then the input x = (x1, Xp, «++ , X,) is due 
to white Gaussian noise. It can be seen from Eq. (6)that each has a normal distribution with 
zero mean and variance N = WN. and that the x, are independent. Because the s. are constants depending 
on the signal to be detected, s = (sy; Soy eee 5 S ), each summand (x. $3) /N has a normal distribution 
with mean s,/N times the mean of % » and°with variance (s:/N)* times the variance of x., which are zero 
and s,°/N respectively. Because the x, are independent, the summands (s3x, )/N are independent, each 
with normal distribution, and therefore their sum has a normal distribution with mean the sum of the 
means -- i.e., zero -- and variance the sum of the variances. 


2 
8 
Dele ee SRO) Sek ox Signal Energy (63) 
N N rs Noise Power Per Unit Bandwidth 


The distribution for(1/N))'x,s, with noise alone is thus normal with zero mean and variance 2E/No- 
‘Recalling from Eq. (61) 


E ol 64 
L(x) = exp eta > 21 b) 
No N 2 
one sees that the distribution for (1/N) ¥ x,s; can be used directly by introducing a defined by 
Aesvexpe | owe , oa = 2+ Le: es) 
No No 
The inequality £(x)>B is equivalent to (1/N) } xis; >q , and therefore 
a. jee) 
N N 
° Se) 
Fy (8) re /-¥e as exp [ate Oye. (66) 


a 


The distribution for the case of signal plus noise can be found by using Eq. (19), which states 
that L 


d Pgy(A(B)) (en) 


=165 bs 
d Py (A(B)) 
t B=By 


Because these probabilities are equal to the complimentary distribution functions for likelihood ratio, 
this can be written as 


d Foy (6) = 6 d Fy(B)- (68) 


ni fe (224 (69) 
dr, (6) S457 ea ex tie da > 


187 


Differentiating Eq. (66), 


and combining (65), (68), and (69), one obtains 


Thus 


[6 ) 
Ny Ny 2E 
4 nr exp - Bly - 
a 


In summary, GQ and therefore in 6B, 


noise alone; the variance of each distribution is 2E/No, and the difference of the means is 2E/N5- 


lO 


(70) 


(71) 


have normal distributions with signal plus noise as well as with 


Foy (2) 


Specs Pye) tte alo tle FE oNd ESSER SNERe See. 
Hope seeee He =a56 CCR EEE ES ae 
ee on Bere ee eet at 
aoa Brea nO} aSaP Ar 45 S.aae 
sues tebe Let ete eA et 
Cee Rise leet et el a7 ee 
vanee 4 P2032 dR eZ inwALa 
7 AT BR DZaAap 420 aaS0n4na sees irs) 
Paine eat Aas 
Thea Ted olc LASS ein ot 17) cht 
PA Le ier mee at 
PALUBE gone case noe 
rrr oy 2M 8 4 ia 
COLLET 
FARR ARREST Sono 
Pte icleelelifetfelot ata 
coy erate 
au aemen7.8 r 
A rT oiasiateh 
7 Tt Zi lads etaelall aa 
capy @eesueese? deeseeeacen 
eee Smee 
HERERO OUOEA 
aaa HZ Ba a sla 
Pir Ace La alee apnea 
DACA Oe eEe 
CA Epiais cere 
Att | LY fea 
AT Pie cte parser | 
SRR EARRASO SUNS 
Ee lS Sl 
PET eee ema 
SERA ee Ceoee 
SaSSaREN Ie esueaoes 
PCL tee ae eter 
SEUSERSIESST Caso 
OCC ate tat [eae ah 
PCE aR ested er tea 
pole lett ta fetes aes ea ip 
QRS CREE SS 
tole led fe palettes lta 
PEPER MEE bic eet ier | 
SESS DS RRR EOOvoeoe 
SERRORERBRE ASCs 
SHES ERAN RETO See 
Seater ic ieleia stata 
val Piste 
EERE EEE EEE 
Hee HEE EEE 


FIG. 2 


RECEIVER OPERATING CHARACTERISTIC 


fn £ IS A NORMAL DEVIATE WITH 7 y 


2 


188 


2 2 
=Ton'y (Mgy-My)~ = d-%y 


on 


The receiver operating characteristic curves in Figs. 2 and 3* are plotted for any case in which 


Va 


6 Yea 


GB 
eet TT YA 
att | 


o 
iis 
i‘ 
= 
any 
oa 
NX 
| | 
4 
ad 


PEE 


OF 258.425 2 3 45678910 20 40 50 60 70 80 
Ba 


BiG =.3 
RECEIVER OPERATING CHARACTERISTIC. 


0.1 


fn £ 1S A NORMAL DEVIATE, Ogy° = Oy7, (Mey My)? = d oy? 


“In Fig. 3, the receiver operating characteristic curves are plotted on "double probability" paper. 
On this paper both axes are linear in the error function erf (x) = (1// 21 a ? exp (- +? /2] ats tie 
makes the receiver operating characteristic straight lines. 


189 


fn #£ has a normal distribution with the same variance both with noise alone and with signal plus noise. 
The parameter d in this figure is equal to the square of the difference of the means, divided by the 
variance. These receiver operating characteristic curves apply to the case of the signal known exactly, 
with d = 2E/N,. 

Eq. (62) \describes what the ideal receiver should do for this case. The essential operation in the 
receiver is obtaining the correlation, of s(t)x(t)dt. The other operations, multiplying by a constant, 
adding a constant, and taking the exponential function, can be taken care of simply in the calibration 
of the receiver output. Electronic means of obtaining cross correlation have been developed recently,1 

If the form of the signal is simple, there is a simple way to obtain this cross correlation,®7 
Suppose h(t) is the impulse response of a filter. The response e,(t) of the filter to a voltage x (copes 


1; 
e(t) = " ZT )on(t- Toad t pe (72) 
2 -0 
If a filter can be synthesized so that 
t) = s(T-t osStst 
oS) Gays (73) 
h(t) = 0 otherwise, 
then 
ue 
e,(T) = % x(Z s(t) ae 5 Moe 


so that the response of this filter at time T is the cross correlation required. Thus, the ideal re- 
ceiver consists simply of a filter and amplifiers. 

It should be noted that this filter is the same, except for a Constant factor, as that specified 
when one asks for the filter which maximizes peak signal to average noise power ratio. 


4.5 Signal Known Except for Carrier Phase 


The signal ensemble considered in this section consists of all signals which differ from a given 
amplitude and frequency modulated signal only in their carrier phase, and all carrier phases are 
assumed equally likely. 


a(t) = f(t) cos(wtif(t)-0) . (75) 


Since the unknown phase angle § bas a uniform distribution, 
AEA(O)me=eeecadek 
S On (76) 


The likelihood ratio can be found by applying Eq.(56), and since the signal energy E(s) is the same 
for all values of the carrier phase @, 


£(x) = exp [-&] /oo FEae dP,(8) (77) 


Expanding s into the coefficients of cos@ and sin@ will be helpful: 


s(t) = f(t) cos(wt+f(t))cos @ + £(t) ein(wt+(t))sin 6 4 ie) 


190 


and 
ab = al 
| > X18; = cos 6 a Ds £(t,) cos (wt, + g(t) 


(79) 
+ sin @ => x, £(t,) sin (wt, + g(t;)) a 


Because we wish to integrate with eg 
pect to 9 to find the likelihood i 
troduce parameters similar to polar coordinates ns such that PC, oa eal 


1 cao pe di 
pate Ed eG f(t,) cos ( wt, + g(t;)) 
z i Z (380) 
Rreine, = = LY x, F(t,) sin (wt, + (t,)) 
and therefore 
i X,8; = 2 cos a! 
ES EI 3) (82) 
Using this form the likelihood ratio becomes 
ex 
L(x) = exp |- = i do 
= ex = es Petes 
i, if P ee (9 °)] a 
Q 
(82) 


nae og to 2) 


where I_ is the Bessel function of zero order and pure imaginary argument. 

Ip is a strictly monotone increasing function, and therefore the likelihood ratio will be greater 
than a value 8 if and only if r/N is greater than some value corresponding to B ° 

In the previous section it was shown that the sum (1/N) x,;5; has a normal distribution with zero 
mean and variance 2E/No if the receiver. input x(t) is due to noise alone; E is the energy of the signal 
known exactly, s(t), and No is the noise power per cycle. Since f(t)cosWt + O (t)) and 
f(t)sin(wt + Z(t)) are signals known exactly, both(r/N) cos J and(r/N) sin Jo have normal distributions 
with zero mean and variance 2E/N.. ‘The probability that due to noise alone r/N = 

(r/N cos 0,)¢ + (r/N sin 0,)* will exceed any fixed value, is given by the well known chi-square dis- 

tribution for two degrees of freedom, Ko(@ 2). The proper normalization yielding zero mean and unit 
variance requires that the variable be t/a) W,/2B(s) S chaos 


i 2 Toe \ i 2 ciwia ae 
Py 7 lectr = K,(a) = exp AoE A (83 


1 t, denotes the 4th sampling time, i.e.,. t, = i/2w. 


*“€ The symbol P(x >a ) denotes the probability that the variable x is not less than the constant a 


191 


If a is defined by the equation 


B = exp Ba 5 - «) 5 (84) 


the distribution for #(x) in the presence of noise alone is in the simple form 


Fy(8) = exp - ca (85) 
It follows from (85) that 
ar(8) = - a exp E S| aba (86) 
If in equation (68), namely 
p dFy(6) = oy(B) ie 


f is replaced by the expression given in (84) and dFy(8) is replaced by that given in (86), then 


2 
GFo,(B) = - exp - ab exp [-=]»(/Be) = (88) 


is obtained. Integration of (88) yields 


@ 2 
F = ex | - = mie & 2E 
sw(B) = exp x | J Peo | :( peor (89) 


Eqs. (85) and (89) yield the receiver operating characteristic in parametric form, and Eq. (8) gives 
the associated operating levels.l5 ‘These are graphed in Fig. for some of the same values of signal 
energy to noise power per unit bandwidth as were used when the phase angle was known exactly, Figs. 
2.and 3, so that the effect of knowing the phase can be easily seen. 

If the signal is sufficiently simple so that a filter could be synthesized to match the expected 
signal for a given carrier phase 9 as in the case of a signal known exactly, then there is a simple 
way to design a receiver to obtain likelihood ratio. For simplicity let us consider only amplitude 
modulated signals (¢(t)=0) in Eq. (75). Let us also choose 9 = 0. (Any phase could have been chosen.) 
Then the filter has impulse response 


h(t) = f(t) cos [w (z-t) | OES iss We (90) 
= 0 otherwise. 
The output of the filter in response to x(t) is then 
+t t 
eo(t) = e X(T )on (Cate rdiga it x(t) £(T+2-t) cos w (T+I-t) at 
-©O t-T 
t (91) 
= cos w(T-t) si x(t) f(T+T-t) cos wt at 
t-T 
‘t 
- sin w (T-t) vf x(t) f(TH-t) sinwt dat 


t-T 


192 


SEEEEHEH 


al 


i 


Fon (2) 
Sees 


==ae 
peau 


| 


Be 
a a 
ay alae 
| a 
(2a 
fe 
O tT 
(9) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 08 0.9 1.0 
F(a) 
Fig. 4 


RECEIVER OPERATING CHARACTERISTIC. 


SIGNAL KNOWN EXCEPT FOR PHASE. 


The envelope of the filter output wiil be the square root of the sum of the squares of the inte- 
grals*¥ and the envelope at time T will be proportional to r/N, since 


e B J a : a 2 : f ‘ e: 
(z\ - fa) 2) aaa ea (F) osm wr ad, (92) 


which can be identified as the square of the envelope of e,(t) at time T. If the input x(t) passes 
through the filter with an impulse response given by Eq. (90), then through a linear detector, the 
output will be (N,/2)r/N at time T, Because the likelihood ratio, Eq. (82), is a known monotone 
function of r/N, the output can be calibrated to read the likelihood ratio of the input. 


If the line spectrum of s(t) is zero at zero frequency and at all frequencies equal to or greater 
than 2w/2nx, then it can be shown that these integrals contain no frequencies as high as w/2n. 


193 


4.6 Signal Consisting of a Sample of White Gaussian Noise 
Suppose the values of the signal voltage at the sample points are independent Gaussian random 


variables with zero mean and variance S, the signal power. The probability density due to signal plus 
noise is also Gaussian, since signal plus noise is the sum of two Gaussian random variables: 


“ 2 2 
fon(®) = \ Sees) mae |-3 me 2 *4 |. (93) 


where n = OWT. 
The likelihood ratio is 


n 
2 A shoal 
£(x) = (#5 ] oo | FE - 2s Dx | 


In determining the distribution functions for £, it is convenient to introduce the parameter a, 
defined by the equation 


(94) 


ai 


N 2 
a (is) exp (sag =) : (95) 


Then the condition ¢(x)2B is equivalent to the condition that (1/N) Dx; -2>q@2, In the presence of 
noise alone the random variables x; //N have zero mean and unit Bee and they are independent. 
Therefore, the probability that the sum of the squares of these variables will exceed q@‘* is the 
chi-square distribution with n degrees of freedom, i.e., 


Fy(6) = K() . (96) 


Similarly, in the presence of signal nee noise the random variables x +S have zero mean ang unit 
variance. The condition (1/N)2 x,22a 2 is the same as requiring that t(1/N+S) 9.142 > (N/#S) a2 > and 
again making use of the chi-square distribution, 


Foy(6) = Ky( aoa) . (97) 


For large ao ueenar of n, the chi-square distribution is approximately normal over the center per teey 
more precisely,1© for a& >> On, 


oe) (98) 
iS )s L alee 
Fy(B) = K,(@%) Ss exp at | 


and 


(99) 


If the signal energy is small compared to that of the noise, / N/N+S is nearly unity and both distribu- 


19h, 


tions have nearly the same variance. Then Figs. 2 and 3 apply to this case too, with the value of d 


given by 
Ae 
@) Sm (n=) Moein/t ae (100) 


For these small signal to noise ratios and large samples, there is a simple relation between 
signal to noise ratio, the number of samples, and the detection index d. 


YE pe Ss 
L-/ mS ~ 2 N fory << 1 , and 


2 
GmctentS 


on” 


(101) 


Two signal to noise ratios, (S/N), and (S/N)p, will give approximately the same operating characteris- 
tic if the corresponding numbers of sample points, nj and No, satisfy 


(8) 
= 2 cae (102) 
(i), 


By Eq. (9), the likelihood is a monotone function of Yx,°. But the output of an energy detector, 


eg(t) = f [xoFP dt = Sh yee (103) 
10) 


is proportional to 2x4". Therefore an energy detector can be calibrated to read likelihood ratio, and 
hence can be used as an optimum receiver in this case. 


l.7 Video Design of a Broad Band Receiver 


The problem considered in this section is represented schematically in Fig. 5. The signals 
FROM 
ped ei BAND PASS LINEAR VIDEO 
FI 
OR MIXER LTER ye ae AMPLIFIER 
POINT A POINT B 


inalfeer, — D) 
BLOCK DIAGRAM OF A BROAD BAND RECEIVER 


and noise are assumed to have passed through a band pass filter, and at the output of the ciMlGer, 
point A on the diagram, they are assumed to be limited in spectrum to a band of width W and center 
frequency w/2n >W/2. The noise is assumed to be Gaussian noise with a uniform spectrum over the band. 
The signals and noise then pass through a linear detector. The output of the detector is the envelope 
of the signals and noise as they appeared at point A; all knowledge of the phaSe of the receiver input 
is lost at point B. The signals and noise as they appear at point B are considered receiver inputs, 


195 


and the theory of signal detectability is applied to these video inputs to ascertain the best video 
design and the performance of such a system. The mathematical description of the signals and noise 
will be given for the signals and noise as they appear at point A. The envelope functions, which 
appear at point B, will be derived, and the likelihood ratio and its distribution will be found for 
these envelope functions. 

The only case which will be considered here is the case in which the amplitude of the signal as 
it would appear at point A is a known function of time. 

Any function at point A will be band limited to a band of width W and center frequency w/2n>W/2. 
Any such function f(t) can be expanded as follows: 


f(t) = x(t) cos wt + y(t) sin wt (105) 
*% 
where x(t) and y(t) are band limited to frequencies no higher than W/2, and hence can themselves be 
expanded by sampling plan C, yielding 
i 
f(t) = ae RAG) cos wt + ¥(4)yy(t) cin wt]. (106) 
The amplitude of the function f(t) is 


r(t) =/ [xce)]? + [yce)]? (107) 


and thus the amplitude at the jth sampling point is 


, aN A 2 2 
SW) asbieatin a ea icean ges (108) 
The angle 
Ss t Ji x4 
95 = arctan = = arccos ty (109) 


might be considered the phase of f(t) at the it® sampling point. The function f(t) then might be 
described by giving the ry and 9; rather than the x; and y;. 

Let us denote by x;, yy, Or Ty, O;> the sample values for a receiver input after the filter (i.e., 
at the point A in Fig. 5). Let ay, by, or fy, Pi» denote the Sample values for the signal as it would 
appear at point A if there were no noise. The envelope of the signal, hence the amplitude sample 
values f;, are assumed known. Let us denote by Fo(fj,$2, «-- » y/o) the distribution function of 
the phase sample values > ;. The probability density function for the input at A when there is white 
Gaussian noise and no signal, with n = 2W!, is 


3 /2 /2 
n 
ty(x, x) = (x es ai SS x + 2)| See 
x,y) = (== & 
Dy om ise FS isl 
and for signal plus noise, it is 
n (111) 


2 n/2 /2 
1 n 
fgy(xy y) = (=) exp - ae (x,-a;)* + > o>) dPo(ayby) . 


* Because any function f(t) at A has no frequency greater than (w/27) + (W/2), the usual sampling 
plan C might have been used on f(t). However, the distribution in noise alone, fy(x;), would 
probably not be applicable. 


196 


Expressed in terms of the (r, @) sample values, Eq. (110) and Eq. (111) become 


n 

(2 ; n/2 - n/2 E (112) 

f(r, 9) = (s=) TT vr. eo mae 
Nivea dD i zp i ? 
aN gay ON jal at 

and n 

2 n/2 n/2 

a, al Cmee 
fon (0) = a) ry ox | ee {r, +f, -2r,f, cos 43) 
: rats oR (113) 


iy (Beles fs) 


The factors [] rj are introduced because they are the Jacobian of the transformation from the x, y. 
sampling plan to the r, © sampling plan,l6, * 

The probability density function for r alone, i.e., the density function for the output of the 
detector, is obtained simply by integrating the density functions for r and 9 With respect to 8. 


en ex On 
ft. = coee eeeedg 
n(t) if i] It fy(tys 0,) 40, do, a0,,9 
2 
) 0) ) (114) 
ES 
2 n/2 n/2 
(i - 2 
or fy(r) = =) I 1 cx i > ry 
i= t=1 5 
and 
Qn en on 
ee d@.++*+da 
foy(r) = i) i iG foy(Fy2 94) 48, 40, eed 
0 0 ) 2 
n 
~ 2 
5 fn/2 n/2 n/; jf; 
1 Bune ean we 
or fsy(r) = (¢) at Ty exp - ON oe (ry +f; ) I Io N aw (D; 8, ) ; 
RY i=l aoe a ; (115) 
n 
mee t/2 n/2 
2 acer, a 2 oe 
iL ake ek By phe . 
ec faye) = (F) q Ti sf 7 ) ex ON 2 See ||, 


Notice that the probability density for r is completely independent of the distribution which the 
Y; had; all information about the phase of the signals has been lost. 
The likelihood ratio for a video input r(t), is 


* sD) 1 n/e n/2 r,f, 
L CM P) = exp | - oy 3 te a T, N ‘ (116) 
= i=1 


= 
For example,in two dimensions, fy(x,y)dx dy = fy(r,@) r dr a@. 


197 


Again it is more convenient to work with the logarithm of the likelihood ratio. Thus 


n/e 2 
1 f pies SM [ece)] at = =, and (117) 
ON 4-27 ab en ° 
n/2 
ere (118) 
Bi ANS se sep ets ead =e ’ 
N ° N 
ch teal 
which is approximately > 
L L( E 7 L nel G@eyl eet Ge One. 
5c n 1,(-ECt) £(t) Jar 
) No 3 2 N 


The function fn I5(x) is approximately the parabola x2/k for slall values of x and is nearly 
linear for large values of x, Thus, the expression for likelihood ratio might be approximated by 


é 2h 
fn L(r(t)) = - a + ae )) [xce)]* [sce] at (120) 
° 
0 
for small signals, and by T 
fn t(r(t)) = C, + Cy if r(t) f(t) at (121) 
(0) 


for large signals, where C, and (p are chosen to approximate fn I, best in the desired range. 

The integrals in Eqs. (120) and (121) can be interpreted as cross correlations, ‘Thus the optimum 
receiver for weak signals is a square law detector, followed by a correlator which finds the cross 
correlation between the detector output and (f(t))@, the square of the envelope of the expected signal. 
For the case of large signal to noise ratio, the optimum receiver is a linear detector, followed by a 
correlator which has for its output the cross correlation of the detector output and f(t), the am- 
plitude of the expected signal. 

The distribution function for £(r) cannot be found easily in this case. The approximation de- 
veloped here will apply to the receiver designed for low signal to noise ratio, since this is the case 
of most interest in detection studies. An analogous approximation for the large signal to noise ratios 
would be even easier to derive. 

First ye shall find the mean and standard deviation for the distribution of the logarithm of 
the likelihood ratio as shown above, 


n/2 
ro ae, oe or pag a. 2 (122) 
Ln Lr) ® - aT D eS me 2 Dey : 


for the case of small signal to noise ratio. The probability density functions for each r; are 


Q4¢,2 pare 
ry r,t; 40 fi 
= = — and 
een (¥4) Mo a 2N wr fa pad he (123) 
we 
2m 
st i 
Spy) = = SP | a ; 


The notation enr;) and Boy (Ty) is used to distinguish these from the joint distributions of all the 
rj which were previously called fy(r) and fg,(r). The mean of each term r32f52/4N in the sum in 
Eq. (122) is 


{e 0) 
Bp PB 2 2 
San ae ie 1¢ 
aie wok = ab WA i Bg (Geq)) tbs 
Ni i 12h) 
sw\ aye / ms oe 


198 


ia! exp 
iN 2 en N 
Pow \ lane : ¥ ; | 
Similarly, co 2 peti 
ry fa? PAG 2,° £3° oe | rs | 
= ass secs) (bg! a CXP | - 1 
(a) 


fe) 
(125) 
(0) 
l h 
Similarly, L rity = si i Gy(ry) drs A tere 
N\ 161 1a J ow 
0) 
@ 
rue. bh ¢, 4 rp aia 
/ aie Dhl | 4a, qe s exp | - ay ar; . 
Py\ rent 16n a 
The integrals for the case of noise alone can be evaluated easily: 
Pyro 2 
‘ (2 “4 ). = 
n\ un en? (126) 


be ce) = fs é 
T, ' 
16N ane 


the integrals for the case of signal plus noise can be evaluated in terms of the conflu 

he : r ‘ ent hyper t- 
ric function, which turns out for the cases above to reduce to a simple polynornial., ‘The peered ss AL 
mulas are collected in convenient form in Threshold Signals? on page 17). The results are 


na (82) 4 EOE), 
SN une ie \| en ? 
and 
e (Hi). 2 £," 0 F = : 4s (127) 
sa Woe v gy / 


uy) 


Since 


2 2 2 128) 
o(z) = pw(2’)- [ n(2)] ( 
the variances of r;2f4°/lN* are 
4 2 
D Tie te = ue as 2 
ef tae 
SN \ hy N 
and ; (129) 
2 4 
2 (=). rie 
oe 
N\ un? ni 


For the sum of independent random variables, the mean is the sum of the means of the terms and the 
variance is the sum of the variances. ‘therefore the means of £n L(r) are 


n/2 2-1 te 2 et n/2 -.4 
fay (Eats) = gh cee y |B Seok eles 
is is N i=1 Ay 
and (130) 
n/2 ee n/2 £2 
Hy (Ln £(r)) = >; oN +Z >, ~ os 
alesil ale=iL 


and the variances of én &(r) are 


n/2 " £," é ao 
Cox (£n L£(r)) & i Eat aa i io 
n/2 zy (131) 
Dieta Tae 


j=. UN 


and 


oy (Lb Lr) 


If the distribution functions of bn &(r) can be assumed to be normal, they can be obtained immedi- 
ately from the mean and standard deviation of the logarithm of likelihood ratio. 

Let us consider the case in which the incoming signal is a rectangular pulse which is M/W seconds 
long.* ‘he energy of the pulse is half its duration times the amplitude squared of its envelope, for a 
normalized circuit impedance of one ohm, 


* The problem of finding the distribution for the sum of M independent random variables, each with a 
probability density function f(x) = x exp [-(1/2) (x¢ +a?) ]I, (ax) arises in the unpublished report 
by J. I. Marcum, A Statistical theory of Target Detection by Pulsed Radar: Mathematical Appendix, 
Project Rand Report k - 113. Marcum gives an exact expression for this distribution which is useful 
only for smali values of M, and an approximation in Gram-Charlier series which is more accurate thari 
the normal approximation given here. Marcum's expressions could be used in this case, and in the case 
presented in Section }.6. 


200 


Thus of the WT numbers {t,} » there are M consecutive ones which are not zero, These are given by 


SEW (132) 


where E is the pulse energy at point A in Fig. 5 in the absence of noise. For this case, Eq. (130) and 
Eq. (131) become 


Moy (2n L(r)) = 2 = 
by (Ln £(r)) =O ’ 
pC L(r)) ash be! : (1 ,! 2 = ) (133) 
MNo Oli l 
and ; > i 
(£n L(r)) = 5 e 
MN, 


The distribution of @n £(r) is approximately normal if M is much larger than one, for, by the 
central limit theorem, the distribution of a sum of M independent random variables with a common dis- 
tribution must approach the normal distribution as M becomes large. ‘lhe actual distribution for the 
case of noise alone can be calculated in this case, since the convolution integral for the gy(r; ) with 
itself any number of times can be expressed in closed form. The distribution of £n £(r) Tore So gnal 
plus noise is more nearly normal than its distribution with noise alone, since the distributions gsy(ri) 


are more nearly normal than gn(ry). 
The receiver operating characteristic for the case M = 16 is plotted in Fig. 6 using the normal 


distribution as approximation to the true distribution. In many cases it will be found that 


i 2E 
ra <tc t (134) 


In such a case the distributions have approximately the same variance. Assuming normal distribution 
then leads to the curves of Figs. 2 and 3, with 


1 (28 
Se Sieh (i) : (135) 


4.8 A Radar Case 


This section deals with detecting a radar target at a given range. That is, we shall assume that 
the signal, if it occurs, consists of a train of M pulses whose time of occurrence and envelope shape 
are known. The carrier phase will be assumed to have a uniform distribution for each pulse independent 


of all others, i.e., the pulses are incoherent. 
The set of signals can be described as follows: 


M-1 
a(t) = Y f(timT) cos (wt+0,) | (136) 


m=0 


where the M angles 6; have independent uniform distributions, and the function f, which is the envelope 
of a single pulse, has the property that 


201 


T 
i e(t+it) £(t+jT) at 
0 


Timea Cie 


a a 


Soe 


\ 


| | 
|_| 
t 
1) ’ |_| 
| 
is 
@ 
N 
| 
| 


IN Se NEN ea 


Y 


Le 


ay 
weil KY 


Fi a Ale 


ENS Se Ne ale 
PENNE NEENIE® 


AV 4 
es Ba (ey AP A eS 
Pz aes, RAL I 


WA 


AL ie oa ee ee ING NT | 
rae] 
acme 
| 
== ae 


BSReeees 


SS ea ashore 


10 20 
100 Fy(2) 


Fig. 6 


30 40 50 60 70 


g 
T 
6 
5 
ae B 
ened 
obs 
eee? | 
wee Pris ete tee 
(opal tery =a free ca 
Spleen ia aa 


80 


RECEIVER OPERATING CHARACTERISTIC 


BROAD BAND RECEIVER WITH 
OPTIMUM VIDEO DESIGN, M = 16 


202 


(2)§Sy 001 


(137) 


where 8, is the Kronecker delta function, which is zero if i # j, and unity if i = j. The time Tis 


the interval between pulses. Eq. (137) states that the pulses are spaced far enough so that they are 
orthogonal, and that the total signal energy is E.* The function f(t) is also assumed to have no fre- 
quency components as high as w/2n. 


The likelihood ratio can be obtained by applying Eq, (56). ‘Then 


L(x) = fon[- Hjoal §- fc x(t) | aP, (8) (138) 
or 


en en ue M-1 
E 2 
£ (x)= cx =i sop oxo f 2 £(t+mz)x(t)cosWt+e )at do_...40 , 
ce) fo) fe} °°6 m=o m° ° M- — 


i (139) 
The integral can be evaluated, as in Section ES yielding 
M-1 - ; 5) 
= E =m 
fee) = om |- 2] y ive Con) . (140 
m=O 
where 
2 Tr 2 f 
Tn 2 2 re 4 
(=) =ly_ f(t+m t )x(t)cos w tat} + a f£(t+mT) x(t)sinwtdt ° (142) 
oO. ° . 
fe) (0) 


This quantity r, is almost identical with the quantity r which appeared in the discussion of the 
case of the signal know except for carrier phase, Section li.5, In fact, each r_ could be obtained in 
a receiver in the manner described in that section. The quantity Yo is connected with the first pulse; 
it could be obtained by designing an ideal filter for the signal 


Bi (t) = f(t) cos Wwt+s) (142) 


for any value of the phase angle 6, and putting the output through a linear detector. The output will 
be (N./2)r,/N at some instant of time t, which is determined by the time delay of the filter. the 
other quantities Yr, differ only in that they are associated with the pulses which come later. The 
output of the filter at time t, + mt will be (No/2)r,,/N. 

It is convenient to have the receiver calculate the logarithm of the likelihood ratio, 


r, (143) 
(FH), 


Thus the fn Ip (r, /N) wast be found for each Ym» and these M quantities must be added. As in the pre-- 
vious section, ry/N will usually be small enough so that fn Ip (x) can be approximated by x°/4, The 
quantities 1/4 (x,,/N)@ can be found by using a square law detector rather than a linear detector, and 
the outputs of the square, law detector at times tp, to +T, «02 » to + (M-1)t then must. be added. The 
ideal system thus consists of an IF amplifier with its passband matched to a single pulse,** a 


* the factor 2 appears in (137) because f(t) is the pulse envelope; the factor M appears because the 
total energy E is M times the energy of a single pulse. 

2 It; te usually most convenient to made the ideal filter (or an approximation to it) a part of the 
IF amplifier 


203 


square law detector (for the threshold signal case), and an integrating device. 
We shall find normal approximations for the distribution functions of the logarithm of the likeli- 
hood ratio using the approximation 


2 
britg (Bee Be (14) 


hye 


which is valid for small values of rp/N.* Substitution of (1) into (13) yields 


: M-1 2 45) 
L Beets ln (1 
nfs +=, ( =) 


The distributions for the quantities r, are independent; this follows from the fact that the individual 
pulse functions f(t+mr ) cos (w t+@,) are orthogonal. The distribution for each is the same as the 
distribution for the quantity r which appears in the discussion of the signal known except for phase; 
the same analysis applies to both cases. Thus, by Eq. (83)** 


aol a ae 
n\n v & = SFP (352 


(i all pstezet] (146) 
Py N S52 — exp |- OE d 
and by (89), 
ee) 
NM x 2} 
st Ee en se - a en 
ra fe Pe) a Z| f= =] (/r)* dans 
a 
or 
re) 
29 acN M 
ae (72 4 = mow |- ain | / 2 om -—e-] Ifa) da, 
a 


o,()= Se( on |- (2) ()] 


ee Nee |- Ean ; (say (2) wey (148) 


* See the footnote below equation (131). 


weet 
nr 


* The M appears in the following equations because the energy of a single pulse is E/M rather than E. 


20k 


This is the same situation, mathematically, as appeared in the previous section, ‘The standard devia- 
tion and the mean for the logarithm of the likelihood ratio can be found in the same manner, and they 
are 


2 
Kon (£n £) = a ’ 
MN? 
fea N (£n L) = (0) 4 
2 2 
toy (mt) = FE (1+), 
° 
and ore (pawl x (149) 
MNo® : 


If the distributions can be assumed normal, they are completely determined by their means and 
variances. ‘hese formulas are identical with the formulas (133) of the previous section. ‘The problem 
is the same, mathematically, and the discussion and receiver operating characteristic curves at the 
end of Section 4.7 apply to both cases. 


4.9 Approximate Evaluation of an Optimum Receiver 


In order to obtain approximate results for the remaining two cases, the assumption is made that 
in these cases the receiver operating characteristic can be approximated by the curves of Figs. 2 and 
3, i.e., that the logarithm of the likelihood ratio is approximately normal. ‘his section discusses 
the approximation and a method for fitting the receiver operating characteristic to the curves of 
Figs. 2 and 3. 

By (68), Foy(@) can be calculated if Fy(2) is knowm. Furthermore, it can be seen that the nth 
moment of the distribution Fy(£) is the (n - 1)*® moment of the distribution Foy(Z). Hence, the mean 
of the likelihood ratio with noise alone is unity, and if the variance of the likelihood ratio with 
noise alone is 0 yx“, the second moment with noise alone, and hence the mean with signal plus noise, is 
1 +o7y@. ‘Thus the difference between the means is equal too y@, which is the variance of the 
likelihood ratio with noise alone, Probably this number characterizes ability to detect signals better 
than any other single number. 

Suppose the logarithm of the likelihood ratio has a normal distribution with noise alone, i.e., 

(0) 


. aia 2M. (x-m)* 
Ty (4) FS Oe exp|- = ] ax, (150) 
n 


where m is the mean and d the variance of the logarithm of the likelihood ratio. ‘the nth moment of 
the likelihood ratio can be found as follows: 


o 2 2 
ple) = f ayt) = = J exp[uxjexo[- S-]ax (151) 
O 


J 2xd -~« 


where the substitution @ = exp x has been made. ‘the integral can be evaluated by completing the square 
in the exponent and using the fact that 


xe - 
if exp| key Je = end { 
-@ 
Thus 2 (152) 
Hy (L™) = exp| a +m | , 
In particular, the mean of (x), which mst be unity, is 
d 


205 


and therefore 


A 

: (15),) 

The variance of £ (x) with noise alone is o Ne and therefore the second moment of £ (x) is 

2 é 
fey 4) = nytt)? + a (£) = lt oy (£) A (155) 
and this mst agree with (152). It follows that 
ata on = E 

Py(4*) = 14 of = exp [2a + om] = exp|a | (156) 


and therefore 


a = £n (1+ Cae) 


The distribution of likelihood ratio with signal plus noise can be found by applying Eq. (68). Thus 
Poll) = Lar, (L) ; 


a (158) 
Pee) Ce f4e,(2) 
y i 
If dF, (£) is obtained from Eq, (150) and 2 is replaced by exp x, then 
@ (x oe: a) 
ay a 2 
Fou (¥ ) Vs if exp[x]exp | - a ax 
Lnk 
or 5 o ( a e (159) 
ee 
F = “A 2 
sn(Z ) EE =P exp a dx 5 


hus the distribution of ¢n£ is normal also when there is signal plus noise, in this case with mean 
d/2 and variance. d. : . 

In summary, it is probable that the variance o ne of the likelihood ratio measures ability to 
detect signals better than any other single number. If the logarithm of likelihood ratio has a normal 
distributicn with noise alone, then this distribution and that with signal plus noise are completely 
determined if o y“ is given, ‘Ihe distribution of ¢n (x) is normal in both cases. Its variance in 
both cases is d, which is also the difference of the means. ‘he receiver operating characteristic 
curves are those plotted in Fig. 2, with the parameter d related to o ne by the equation 


a Sofa Oty Ie (160) 


In the case of a signal known exactly; this is the distribution which occurs. In the cases of 
Section .6, Section 4.7, and Section ).8 this distribution is found to be the limiting distribution 
when the number of sample points is large. Certainly in most cases the distribution has this general 
form. ‘Thus it seems reasonable that useful approximate results could be obtained by calculating only 
o Ne for a given case and assuming fnat the ability to detect signals is approximately the same as if 
the logarithm of the likelihood ratio had a normal distribution, On this basis, oy‘ (2) is calculated 
in the following sections for two cases, and the assertion is made that he receiver operating charac- 
teristic curves are approximated by those of Fig. 2 with d =n (1 + oN ) 


206 


410 Signal Which is One of M Orthogonal Signals 


Suppose that the set of expected signals includes just M functions S(t) , all of which have the 
same probability, the same energy E, and are orthogonal. That is, 


2G 
4) a,(t) 8g(t) at = ES iq. (161) 
) | 


Then the likelihood ratio can be found from Eq. (56) to be 


a E Te 
A(x) = Y' exp Bake F D *2h15| 


k=1 i=1 
or ; M n ¥ (162) 
ab i < eee 


where s,. are the sample values of the function s t). 
wish noise naigne, each term of the form (1/N)).8) x s,4 has a normal distribution with mean zero 

and peeiseS (ys N= 2E/No se Furthermore, the h “|e ne quantities (1/N)27% +14 Sih are independent, 
since the functions s;(t) are ° orthogonal. It follows that the terms exp Ope 145i E/N, are 
independent, 

Since the logarithm of each term Z = exp (ifm)> 2 - E/N, has a normal distribution with 
mean (-E/N,) and variance 2E/N,, the moments of the sittin can $e found from Eq. (152). ‘he nth 
moment is 


n = a Ez 
pb (Z") exp Ee 1) T 5 (163) 
It follows that the mean of each term is unity, and the variance is 


Oy (2) = H(z) - Ec *|- oxo] 2 | a. | (164) 


fo) 


ae variance of a sum of apacenion® random variables is the sum of the variances of the terms. There- 
ore 
oy (ML) = M Jom (FF) 3] ; 
° : (165) 
and it follows that the variance of the likelihood ratio is 
A(t) S = E cp (=) - 1] 2 (166) 


It was pointed out in Section 4.9, that the receiver operating characteristic curves are approxi- 
mately those of Fig. 2, with 


ad = £n (1+ o,*) = a(.-}. 4 > (#)), | (167) 


* The reasoning is the same as that in Section .h. 


207 


This equation can be solved for 2E/No: 


i = Ln h +M (e4 - »| . (168) 


Suppose it is desired to keep the false alarm probability and probability of detection constant. 
This requires that d be kept constant. Then from Eq. (168) it can be seen that if the number of possible 
signals M is increased, the signal energy E must also be increased, 


Nea af Signal Which is One of M Orthogonal Signals with Unknown Carrier Phase 
Consider the case in which the set, of expected signals includes just M different amplitude-modu- 
lated signals which are known except for carrier phase. Denote the signals by 


s(t) = f(t) cos (wt + 9) : (169) 


It will be assumed further that the functions f;,.(t) all have the same energy E and are orthogonal, i.e., 
463 

4 170 

J Ail") £5) Be te 28 Bia (170) 


where the 2 is introduced because the f's are the signal amplitudes, not the actual signal functions. 
Also, let the f,,(t) be band-limited to contain no frequencies as high as. . ‘Then it follows that any 
two signal Sane ons with different envelope functions will be orthogonal. Let us assume also that the 
distribution of phase, ®, is uniform, and that the probability for each envelope function is 1/M. 

With these assumptions, the likelihood ratio can be obtained from Eq. (66), and it is given by 


M Ox a 
L(x) = 5 x = i exo] y 844 | ae 

3 Ag 

k=1 i=1 ° 


where S,; are the sample values of s(t) » and hence depend upon the phase @. ‘The integration is the 
same as in the case of the signal known except for phase, and the result, obtained from Eq. (82), is 


M 
Dk 
eS i Dex - [tol : (172) 


== (2 zp f(t) osetia + (2 x, £,.(ty) sin wt,)* . (173) 


Now the problem is to find ay2(£). ‘the variance of each term in the sum in Eq. (172) can be 
found since the distribution function with noise alone can be found in Section ).5. Since the f;,(t) are 
orthogonal, the distributions of the r, are independent, and the terms in the sum in Eq. (172) are 
independent. Then the variance of the likelihood ratio, o y*(£), is the sum of the variances of the 
terms, divided by M¢. 

The distribution function for each term exp ( -E/No) I (r;/N)is given in Section 4.5 by Eqs. (8) 
and (85). If @ is defined by the equation 


B = exp fr #]% «/z | ’ (17h) 


then the distribution function in the presence of noise for each term in Eq. (172) is 


where 


208 


(175) 
ry") (8) = exp |. | ' 


The mean value of each term is 


fe) ) 
(x) 2 (x) E 25 oa 
Hy (8) = vay (8B) = J exp - z] i (Jee) cx ake : (276) 


(¢) 


This can be evaluated as on page 17) of Threshold Signals®, and the result is thaty(K)(Q) = 1. 
The second moment of each term is 


pty") (@2) = si p° ar") (a) 
of © A (17) 
aC | @ exp -=]= 4 
fo ° 


The integral can be evaluated as in Appendix E of Part II of reference 17, and the result is 


py) (8°) em wey f (178) 


= 

ae 
= 

® 

i] 

7) 

id 

’ 
IB 


The variance of each term in Eq. (172) is 


2 : 2 . 
lo cy = plt)ie*) - E e) 5s ra) (179) 
It follows that the variance of Mf is 


2 si 25 
on (M2) = M Beak | » and therefore (180) 


(181) 
of (t) = i fro #)- | , 


since the variance for the sum of independent random variables is the sum of the variances. 
If the approximation described in Section }.9 is used, the receiver operating characteristic 
curves are approximately those of Fig. 2, with 


a= £n (1+ oy) 2 sol -4+ht(®)). (182) 


1.12 The Broad Band Receiver and the Optimum Receiver 


A few applications of the results of Section 4 are suggested in Table I, Section 4.1, Two further 
examples of practical knowledge obtainable from the theory are presented in this section and in the 
next. 


209 


One common method of detecting pulse signals in a frequency band of width B is to build a receiver 
which covers this entire frequency band. Such a receiver with a pulse signal of known starting time 
is studied in Section }.7, This is not a truly optimum receiver; it would be interesting to compare it 
with an optimum receiver, We have been unable to find the distribution of likelihood ratio for the 
case of a signal which is a pulse of unknown carrier phase if the frequency is distributed evenly 
over a band, However, if the problem is changed slightly, so that the frequency is restricted to 
points spaced approximately the reciprocal of the pulse width apart, then pulses at different frequen- 
cies are approximately orthogonal, and the case of the signal which is one of M orthogonal signals 
known except for phase can be applied. Eq. (182) should be used with M equal to the ratio of the 
frequency band width B to the pulse band width. Since the band width of a pulse is approximately the 
reciprocal of its pulse width, the parameter M used in Section .7 also has this value. Curves showing 
2E/No as a function of d are given in Fig. 7 for both the approximate optimum receiver and the broad 
band receiver for several values of M. In the figure, d is calculated from Eq. (135) and Eq. (182), 
which hold for large values of M, 


100 
90 
80 


70 


60 


50 


40 


aie ee 


20. 


FIG. 7 COMPARISON OF OPTIMUM AND BROAD BAND RECEIVERS 


210 


4.13 Uncertainty and Signal Detectability 


In the two cases where the signal considered is one of M orthogonal signals, the uncertainty of 
the signal is a function of M. ‘This provides an opportunity to study the effect of uncertainty on 
signal detectability. In the approximate evaluation of the optimum receiver when the signal is one 
of M orthogonal functions, the ROC curves of Figs. 2 and 3 are used with the detection index d given by 


ite ab 2 
= pas eee 16 
d éa|. regen (ie) |. (167) 
This equation can be solved for the signal energy, yielding 
T = Ln [a -M+ vet |, SB fn u+fn (o41)  , (175) 


° 


the approximation holding for large 2E/No- * From this equation it can be seen that the signal energy 
is approximately a linear function of Re M when the detection index d, and hence the ability to de- 
tect signals, is kept constant. It might be suspected that 2E/No is a linear function of the entropy, 

-Yp; £np, , where Py is the probability of the i h orthogonal signal. The linear relation holds only 
when all he p; are * equal. The expression which occurs in this more general case is: 


eB es Paloaylees (aeom (176) 


No 


LIST OF REFERENCES 


1. 5S. Goldman, Information Theory, Prentice-Hall, New York, 1953. Chapter II, pp. 65-8, is devoted 
to sampling plans. 


2. $C. E, Shannon, "Communication in the Presence of Noise," Proc. I.R.E., Vol. 37, pp. 10-21, 
January, 199. 


3. U.Grenander, "Stochastic Processes and Statistical Inference," Arkiv For Mathematik, Bd 1 nr 17, 
pe 195, 1950. 


lh. Jd. Neyman, and E, S. Pearson, "On the Problems of the Most Efficient Tests of Statistical 
Hypotheses," Philosophical Transactions of the Royal Society of London, Vol. 231, Series A, 
De 209521933. 


5. J. L. Lawson, and G. E, wUnignbeck; Threshold Signals, McGraw-Hill, New York, 1950. 


6. P.M. Woodward and I. L. Davies, "Information Theory and Inverse Probability in ‘'elecommunica- 
tions," Proc. I.E.E. (London), Tol. 99, Part III, pp. 37-l, March, 1952. 


7. I. L. Davies, "On Determining the Presence of Signals in Noise," Proc. I.E.E. (London), Vol. 99, 
Part ITI, pp. 5+51, March, 1952. aed 


8. <A. Wald, Sequential Analysis, John Wiley and Sons, 197. 
9. W. C. Fox, "Signal Detectability: A Unified Description of Statistical Methods Employing Fixed and 


Sequential Observation Processes," Electronic Defense Group, University of Michigan, ‘echnical 
Report No. 19 (unclassified). 


* If 2E/N, > 3, the error is less than 10%, 


211 


HO 


Jia ies 


12. 


A he 


oe 


1G. 


17. 


A, Wald and J. Wolfowitz, "Optimum Character of the Sequential Probability Ratio Test," Ann, 
Math. Stat., Vol. 19, p. 326, September, 19/8. 


E. Reich, and P, Swerling, "Ihe Detection of a Sine Wave in Gaussian Noise," Journal Applied 
Physics, Vol. 24, p. 289, March, 1953. 


R. C. Davis, "On the Detection of Sure Signals in Noise," Journal Applied Physics, Vol. 25; 
pp. 76-82, January, 195). 


J. V. Harrington, and T. F. Rogers, "Signal-to-Noise Improvement Through Integration in a 
Storage Tube," Proc. I.R.E., Vol. 38, p. 1197, October, 1950. 


A. E. Harting, and J. E. Meade, "A Device for Computing Correlation Functions," Rev. Sci. Instr., 
Vol, 23, 375° 1952. 


Y. W. Lee, IT. P. Cheatham, Jr., and J. B. Wiesner, "Applications of Correlation Analysis to the 
Detection of Periodic Signals in Noise," Proc, I.R.E., Vol. 38, p. 1165, October, 1950. 


M. J. Levin, and J. F. Reintjes, "A Five Channel Electronic Analog Correlator," Proc. Nat, El. 
Conf., Vol. 8, 1952. 


D. O. North, "An Analysis of the Factors which Determine Signal-Noise Discrimination in Pulsed: 
Carrier Systems,'' KCA Laboratory Rpt PIR-6C, 193. 
See also Reference 5, p. 206. 


Graphs of values of the integral (89) along with approximate expressions for small and for large 
values of appear in Rice, S. 0., "Mathematical Analysis of Random Noise," B.S.T.J., Vol. 23, 
p. 282-332 and Vol. 2h, p. 46-156, 19-5. Tables of this function have been compiled by J. I. 
farcum in an unpublished report of the Rand Corporation, "Table of Q-Functions," Project Rand 
Report RM-399, 


P. G. Hoel, Introduction to Nathematical Statistics, New York: Wiley, 19147, p. 2h6. 


The material of Sections 2 and 3 of this paper is drawn from reference 9 above and from Part I of 
W. W. Peterson, and Tf. G. Birdsall, "Ihe Theory of Signal Detectability," Electronic Defense 
Group, University of Michigan, Technical Report No. 13 (Unclassified), July, 1953. Part II of 
that report contains the material in Section of this paper. Other work in this field may be 
found in D. Middleton, "Statistical Criteria for the Detection of Pulsed Carriers in Noise," 
Jour. App. Phys., Vol. 2h, pe 371, April, 1953; D. Middleton, "The Statistical Theory of Detec- 
tion, I: Optimum Detection of Signals in Noise," M.I.T. Lincoln Laboratory, Technical Report No 
35, November 2, 1953; D. Middleton, "Statistical Theory of Signal Detection," Trans, 1.R.E. : 
PGIT-3, p. 26, March, 1954; D. Middleton, W. W. Peterson, and T, G. Birdsall, "Discussion of 


‘Statistical Criteria for the Detection of Pulsed Carriers in Noise. I, II'" 
: e Journal Applied Phy~ 
sics, Vol. 25, pp. 128-130, January, 195). : i 


212 


THE HUMAN USE OF INFORMATION 
I. SIGNAL DETECTION FOR THE CASE OF THE SIGNAL KNOWN EXACTLY 


Wilson P. Tanner Jr. and John A. Swets 


University of Michigan 


Abstract 


A theory of visual detection is developed, based on the model provided by the theory of signal 
detectability,= and, more generally, by the theory of statistical decision. Two experiments are 
reported which test some predictions of the theory for the case of the signal-known-exactly. These 
experiments demonstrate that the human observer tends toward optimum behavior, Where optimum behavior 
is defined as that behavior which maximizes the expected gain from the decision. Their results show 
the proportion of correct detections to be dependent upon the proportion of false alarms; they in- 
dicate that neural activity is a power function of signal intensity. The data also demand a re- 
evaluation of the threshhold concept. Predictions are made for the data obtained using two different 
methods of response, forced-choice and yes-no, and the internal consistency of the theory is demon- 
strated. The predictions of the theory are compared with contrasting predictions of conventional 
sensory theory; the data are also related to conventional theory. 


Introduction 


There is some indication that the theory of statistical decision, or in particular, the theory 
of statistical inference, constitutes a model of relevance to several aspects of human behavior. 
When a set of rather reasonable assumptions about neurophysiology is coupled with the assumption that 
the organism tends toward optimum behavior, the theory of statistical decision permits the specifica- 
tion of behavior in a variety of situations that submit to experimental manipulation. In this paper, 
experiments are reported which were designed to test the predictions that follow from this model for 
the behavior of the human observer in a visual detection situation. Since several of these predic- 
tions are in conflict with predictions of conventional sensory theory, the conventional theory is 
reviewed in the next section. 


Conventional Sensory Theory 


For the present purposes, the most significant aspect of conventionel sensory theory is ite im- 
plication that so-called sensory phenomena are peripherally determined. This point of view is direct- 
ly related to the concept of a threshold, the notion that if some fixed amount of neural activity in 
the sensory system is exceeded, a signal is detected by the observer with a probability of unity. In 
this framework, the observer's decision concerning the existence of a signal is assumed to depend en- 
tirely upon whether the threshold is exceeded, and the threshold is assumed to be independent of 
control by essentially non-sensory variables which might influence the attitude or set of the observer. 
It is also assumed that the threshold is high enough to be exceeded very rarely by sensory system 
activity unrelated to the presence of a physical signal. If this view is not explicit, it is, at 
least, implied by the usual treatment of the data. 


The primary data from visual detection experiments are frequencies of detection as a function of 
the intensity of the light signal. Im Fig. 1, the dotted lines represent the form of the results of 
a hypothetical experiment. Consider first a single dotted line. Any point on the line might represent 
un experimentally determined point. Conventionally, this point is corrected for chance successes by 
application of the formula, 


cel oh a (1) 


ms , 
This paper is based on work done for the U. S. Army Signal Cor 
5 Se ps under Contract No. Da -~ 36 ~ 0 
sc = 15358, The experiments reported herein have been reported previously in a techn rs sat 
the Electronic Defense Group of the University of Michigan, These experiments were Petioiae i th 
Vision Research Laboratory of the University of Michigan, aa oy 


213 


where p' is the observed proportion of positive responses, p is the corrected proportion of positive 
responses, and c is the intercept of the dotted line at zero signal intensity. 

The justification for the use of this chance-correction formula depends upon the validity of the 
assumption of an independence of the events comprising the p and c terms, or the assumption that a 
"false alarm" is a guess, independent of neural activity in the sensory system relevant to the deci- 
sion concerning signal existence. This assumption implies, and is implied by, the assumption of a 
threshold. In this context the solid curve of Fig. 1, the curve onto which each of the dotted curves 
can be mapped by application of the chance-correction formula, is regarded as a "true" curve; that is,| 
its parameters are assumed to be characteristic of the observer's sensory system. 


Statistical Decision as a Model for a Theory of Visual Detection 


The Basis for Considering this Model 


The relevance of the theory of statistical decision as a model for visual detection is suggested 
by the very likely assumption that spontaneous neural activity occurs in the human's sensory system. 
Although direct observation of this activity has been made entirely on infra-human organisms, the 
data strongly suggest that extrapolation to humans is reasonable. Now, if the problem of detection 
is the detection of signals (which presumably have randomly distributed neural effects) in the pre- 
sence of random interference or noise, then the task of the observer is that of testing statistical 
hypotheses, and the model provided by the theory of statistical decision should aid in describing 
his behavior. This point of view suggests replacing the concept of a threshold by a concept of a 
criterion range of acceptance, the extent of which is controlled by the observer in the interests of 
optimum behavior. In addition, it suggests considering the probability that noise (spontaneous neural 
activity) alone may reach levels which will be in the criterion of acceptance. Also, in contrast with 
conventional theory, a dependence is implied between the conditional probability that neural activity 
when a signal exists is in the criterion, and the conditional probability that neural activity when 
no signal is present is in the criterion. 


Elaboration of the Theory 


In this and subsequent sections, a new theory of visual detection is developed, based on the model 
constituted by the theory of statistical decision. A chronologically intermediate step between the 
theory of statistical decision and the sensory theory presented here is the theory of signal detecta- 
bility, developed for theoretical observers, by Peterson, Birdsall, and Fox.2 The mathematical devel- 
opments and symbols used below are those of Peterson, Birdsall, and Fox, unless otherwise stated. 


The Form and Treatment of Sensory Information. It is supposed that the information relevant to 
detection is a display of nevral activity at the cortical level, In the case under consideration in 
which a signal is presented at a specified time in a specified spatial location, it is assumed that the 
observer will place the same restrictions on the relevant display. Thus, if the observer is asked to 
‘state whether a signal exists in location A at time B, he is assumed to consider only that information 


in the neural display which refers to location A at time B. 
A judgement concerning the existence of a signal is presumably based upon some measure, x, of 


neural activity. It is assumed that there exists a statistical relationship between the measure and 
signal intensity. That is, the more intense the signal, the greater is the average of the measures 
resulting. Thus, for any signal there is a universe distribution which is, in fact, a sampling dis- 
tribution. It includes all measures which might result if the signal were repeated and measured an 
infinite number of times. The mean of this universe distribution is associated with the intensity 
level of the signal. The variance may be associated with other parameters of the signal such as 
duration or size, but this is beyond the scope of this paper. 

Fig. 2 shows two distributions: WN representing the case where noise alone is sampled, that is, 
no signal exists, and S+N, the case where signal plus noise exists. The N and S+N distributions are 
assumed to be probability density functions; thus the ordinate is probability density. The mean of N 
depends upon the constant, prevailing background intensity; the mean of S+N depends on background- 
plusesignal intensity, The variance of N depends on signal parameters, not background parameters, 
in the case considered here; that is, where the observer knows a priori that if a signal exists, 
then it will be a particular signal. From the way the diagram is conceptualized, the greater the 
measure x, the more likely it is that this sample represents a signal, But one can never be certain. 
Thus, if an observer is asked if a signal exists, he is assumed to base his judgment on the quantity 
of neural activity. He makes an observation, and then attempts to decide whether this observation is 
more representative of N or S+N, His task is, then, the task of testing a statistical hypothesis, 

For mathematical convenience, it is assumed that the distributions shown in Fig. 2 are Gaussian, 
with variances equal for N and all values of S+N. Experimental results suggest that equal variance 


214 


is not a true assumption, but the deviations are not so great that the inconvenience of a more precise. 
assumption is justified for the purpose of this analysis, It is also assumed that there is a cut-off 
point such that any measure of neural activity which exceeds that cut-off is in the criterion; that is, 
any value exceeding the cut-off is accepted as representing the existence of a signal, and any value 
less than the cut-off is regarded as representing noise alone. Again, for mathematical convenience, 
the cut-off point is assumed to be well-defined and stable. 

Now, consider the way in which the placing of the cut-off affects behavior in the case of a given 
signal. In the lower right-hand ccrner of Figure 3, the distributions N and S+N are reproduced for a 
value of d' = 1. d' is the difference between the means of N and S+N in terms of the standard devia- 
tion of N. The criterion scale is also calibrated in terms of the standard deviation of N. On the 
abscissa there is Py(A), the probability that if no signal exists the measure will be in the criterion, 
and on the ordinate Psy(A), the probability that if a signal exists the measure will be in the 
criterion. A, in this terminology, symbolizes "acceptance of the hypothesis that a signal exists." 

If the cut-off is at - ©, all measures are in the criterion: Py(A) = Psy(A) = 1. At minus one 
standard deviation Py(A) = .8h, Psy(A) = .98. At zero, Py(A) = .5, Psn(A) = .8h. At plus one Py(A) = 
«16 and Poy(A) = .5, and for plus @, Py(A) = Poy(A) = 0. Thus, for d' = 1, this is the curve showing 
possible detections for each false-alarm rate. 


The Optimization Assumption. At this point, it is necessary to make an assumption which will per- 
mit specification of the behavior expected of the observer. In conventional theory, the assumption of 


a fixed threshold has.made it possible to derive testable predictions, In the theory presented here, 
where it is assumed that the position of the cut-off point between acceptance and rejection of the 
existence of a signal is controlled by the observer, it is necessary to define the method of control 
exerted by the observer on the cut-off point in order to make predictions. As stated above, it is 
assumed that the observer's behavior tends toward optimum behavior. More specifically, it is assumed 
that the observer sets the cut-off point at a position that maximizes the expected gain. That is to 
say, the level of the cut-off is determined so as to maximize an expected payoff in terms of the values 
of hits and correct rejections, and the costs of false alarms and misses. 

Peterson, Birdsall, and Fox* have shown that the optimum behavior (in this case, that behavior 
resulting in maximizing the expected gain) in any given experimental condition may be represented by 
a point on the curve of Figure 3 where its slope is w, where 


1 - P(SN) (Vy-Cca * Kya) (2) 


Wwe —————EES—S SSS . 
P(SN)  (Vsn-a + Ken-ca) 
where P(SN) is the a priori probability of signal occurence, Vyeca is the value of a correct rejection, 
KyeA is the cost of a false alarm, Vsy.a is the value of a correct detection, and Ksy-ca is the cost of 
a miss. 

Equation (2) can be derived from the expression for the expected value of a decision, 


(3) 
- Ksy-ca P(SN*CA) - Ky., P(N-A), 


by substituting conditional probabilities for the probabilities of joint occurence; e.g., P(SN) Psy(A) 
for P(SN*A). Then, maximizing EV is equivalent to requiring that Psy(A) - w Py(A) be a maximum. The 
value of w thus defines the optimum criterion. More precisely, the optimum criterion consists of all 
measures of neural activity with likelihood greater than w; i.e., wis the critical value of the like- 
lihood ratio where likelihood ratio for a particular measure is defined as fsy(x)/ fy(x), the ratio 

of the probability density for that measure if there is signal plus noise to the probability density 

if there is noise alone. It can be seen from Eq. 2, that as P(SN) or Vsy.q increases or Ky.q decreases, 
w becomes smaller and it is worthwhile to accept a higher false alarm rate in the interest of achieving 
a greater percentage of correct decisions. 


The Predicted Form of the Data, Figure ) shows a family of curves of Psy(A) vs. Py(A) with d' as 
the parameter. This is to be compared with the predictions of conventional theory shown in Figure 5, 
with Py(A) assumed to represent guesses, or spurious responses unrelated to relevant neural activity. 
For each value of signal intensity, it is assumed that there is a true value of Psy (A) either for 
Py(A) = O or for some very small value. The chance correction should transform each of these to 
horizontal lines. 

Another way of comparing the predictions of this theory with those of conventional theory is to 
construct curves showing the predicted shape of the psychophysical function, These curves are shown 
in Fig. 6, where P(A), the probability of acceptance, is plotted as a function of d', for comparison 
with the curves of Fig. 1. These curves will not correct into the same curve by the application of the 
chance correction. The shift is horizontal rather than vertical. The dotted portions of the curve 


215 


show that we are dealing with only a part of the curve, and thus, in terms of this theory, it is im- 
proper to apply a normalizing procedure such as the chance-correction formula to that part of the 
curve, 


The Forced-Choice Method of Response 


The preceeding discussion specifies the behavior expected of the observer, in terms both of the 
theory presented here and conventional theory, when the so-called yes-no method of response is employed. 
A second method of response has been used in psychophysical experimentation; this method is known as 
the forced-choice method. In this method, the observer does not report directly on the existence of a 
signal but is required to indicate detection by correctly identifying some attribute of the signal. 

In the specific version of the forced-choice method most commonly used, the observer knows that on each 
trial the signal will occur in one of four short, adjoining time intervals, and he is forced to choose 
in which of these intervals he believes the signal occurred. 


The Predicted Form of Forced-Choice Data. While conventional theory predicts the same form for 
data collected under yes-no and forced-choice methods, the theory presented here leads to different 
predictions for the form of the data collected using the two procedures, The predictions stemming 
from this theory for forced-choice data are, as in the case of yes-no data, based on the assumptions 
that the observer works with a continuous variable, the measure of neural activity or likelihood ratio, 
and behaves optimally in terms of available information. Optimal behavior requires that the observer 
select the interval with the greatest associated value of likelihood ratio. Then the probability that 
a correct answer P(C) will result for a given value of d', for the four-choice or four-interval situa- 
tion, is the probability that the one sample from the S+N distribution is greater than the greatest of 
three samples from the distribution of N. For the four-choice situation, 

+0 
Pp(c) = [| F(x)]? g(x) ax, (4) 
—CO 


where F(x) is the area of N and g(x) is the ordinate of SN, In Fig. 7, P(C), as determined by the 
integration, is plotted as a function of d', under the assumption of equal variance of the N and S+N 
distributions, 


Criterion of Internal Consistency 


Since the theory predicts a different form of data for the two response procedures, forced-choice 
and yes-no, and since the predictions for the two situations are based on the same neurological para- 
meters, the existence of an internal consistency check on the theory is implied. The information on 
which the observer bases his decision is contained in the same neural display in the forced-choice sit- 
uation as in the yes-no situation, and presumably, the values of d' obtained from the two procedures 
for any given signal intensity mst be the same. Thus, if the values of d' are estimated from the data 
obtained when one of these methods is employed, these estimates should furnish a basis for predicting 
the data obtained using the other method if the theory is internally consistent. Or, equivalently, the 
criterion of internal consistency is satisfied if both sets of data yield the same estimates of d'. 


The First Experiment 


Procedure 


An experiment was conducted to test the internal consistency of the proposed theory, using three 
University of Michigan sophomores as observers. A series of eight experimental sessions involving the 
forced-choice procedure was followed by a series of sixteen sessions in which the yes-no method of 
response was used. All of the experimental sessions employed a circular signal, thirty minutes of 
visual angle in diameter, with a duration of .01 second, on a ten foot-lambert background. Five in- 
tensity values of signal were used in the forced-choice sessions. The four greatest of these, reduced 
by a .1 fixed filter, were used in the yes-no sessions. Details of_the experimental procedure and the 
laboratory have been published by Blackwell, Pritchard, and Ohmart.+ 

In the first four yes-no sessions, two values of a priori probability, P(SN) equal to .8 and .) 
were used. The observers were informed of the value of P(SN) before each session. No values or costs 
were incorporated in these four sessions; they were excluded from the analysis as practice sessions. In 
the next twelve yes-no sessions, all of the information necessary for the calculation of a w (the best 
possible decision level) was furnished to the observers (i.e., P(SN) and the various values and costs). 
While they did not know the formal calculation of w, that they knew the direction of change in the 
cut-off point indicated by a change in any of the factors involved in the w - equation was indicated 
by the fact that the obtained values of Py(A) varied appropriately with changes in the information 
given them. The values and costs were made real to the observers, for they were actually paid in cash. 


216 


Each session the observers realized a bonus of between one and two dollars. 

The first four of these sessions each carried the same value of w since the same payoff was main- 
tained and P(SN) was held at .8, A high value of Py(A), or false-alarm rate, resulted, In the next 
four sessions with P(SN) held at .8, Ky., and Vy.cq were gradually increased from session to session 
(not within sessions) until Py(A) dropped to a low value, Then P(SN) was dropped to .4, Ky., and Vy.ca 
were reduced so that for the next session Py (A) stayed low. The last three sessions involved successive 
increases in Vsy.a and Ksy.ca, again forcing Py(A) toward a higher value. 


Results 


The Internal Consistency Check. The yes-no data obtained from each observer for each value of 
signal intensity were plotted in the form of scatter diagrams of Pgy(A) vs. Py(A). Comparison of these 
scatter diagrams with the theoretical curves of Fig. provides an estimate of d' from yes-no data. 
Each d' estimated in this way is based on 560 observations. Estimates of d' from forced-choice data 
are made by entering the forced-choice curve (Fig. 7) using the observed proportion of correct responses 
as an estimate ,of P(C). The last two forced-choice sessions were used in this analysis; each value of 
a' estimated from forced-choice data is based on 100 observations. 

Figs. 8, 9, and 10 show log d' as a function of log signal intensity for the three observers. In 
general, the agreement is good. The deviation of the forced-choice points at the top and bottom of the 
graphs can be explained on the basis of sampling variation. For the third observer, the lowest forced- 
choice point is off the graph to the right of the line. 


The Relationship Between Neural Activity and Signal Intensity. Figs. 8, 9, and 10 point up 
another difference between conventional sensory theory and the theory presented here. In conventional 


theory, the assumption is made that the relationship between neural activity and signal intensity is 
linear, The results obtained from this experiment suggest that neural activity is a power function of 
signal intensity, a result that is consistent with a more direct type of neuro-physiological data, in 
particular, the results of electrical recordings from optic nerve fibers. 


An External Consistency Check. ‘he results reported above support internal consistency. The 
theory also turns out to be consistent with the data in the literature, for, when the d' vs. signal 
intensity function for any one of the observers is used to predict probability of detection as a 
function of signal intensity in terms of this theory, the result closely approximates a type of curve, 
a normal ogive, that is frequently reported. Chi-square analyses suggest that approximately fifteen 
times the ordinary amount of data would be required to distinguish the predicted curve from a normal 
ogive. 


Additional Analyses Suggested by the Theory. According to conventional theory, application of the 
chance correction should yield corrected values of Psy(A) which are independent of Py(A), or should 
yield corrected thresholds in the conventional sense which are independent of Py(A). Rank-order 
correlations for the three observers between Py(A) and corrected thresholds (.30, -71, .67) are highly 
significant; the combined p = .0002. Sinilar correlations were obtained (.32, .62, .76) between Py(A) 
and corrected Poy (A). These results are consistent with the theory presented here, 

Another method of comparison is to fit the scatter diagrams of Psy(A) vs. Py(A) by straight lines. 
According to conventional theory, these straight lines should intercept the ants (1.00, 1.00). Samp- 
ling error would be expected to send some of the lines to either side of this point. The four scatter 
diagrams obtained from each of the three observers are reproduced in Figs. 11, 12, and 13. All twelve 
of these lines intersect the line Psy(A) = 1.00 at values of Py(A) between O and 1.00, approximately in 
an order which would be predicted if these lines were arcs of the curves Poy (A) VSe Py (A) as defined by 
the theory proposed here. 


The Second Experiment 


A second experiment was conducted to test the theory proposed here and to provide additional basis 
for selecting between this theory and conventional theory. This experiment was suggested by R. Z. Nor- 
man; its results were reported by Swets .3 


The Rationale for This Experiment 


As pointed out above, conventional theory is consistent with the view that the mechanism of 
detection is one that triggers when the amount of neural activity exceeds a criterion amount, and loses 
all discrimination among quantities of neural activity that fall short of this amount. Thus, for a 
(four-choice) forced-choice situation where the observer is required to indicate a second choice as 
well as a first choice, conventional theory leads to the prediction that, when the first choice is in- 
correct, the second choice will be correct with a probability of .33, since the second choice is made 


217 


from among three intervals presumably on a chance basis. On the other hand, the theory proposed here 
supposes that the observer works with a variable x (likelihood ratio) that is continuous throughout the 
range of x, not merely continuous above a critical point. If this is the case, the observer should be 
able to rank the four values of x associated with the four intervals; then the probability of a correct 
second choice, given an incorrect first choice, is greater than .33. The relationship between this 
predicted probability and d' is given by the expression 


afelewa]? [a - F(x) | g(x) dx (5) 
il a a rix)|2 g(x) dx 
- 


where the symbols have the same meaning as in Eq. (Gio ye 
Results 


Data were collected from four observers, each of whom served in three sessions. Each session 
included 150 observations for which both a first and second choice were required. The resulting twelve 
proportions of correct second choices are pletted against d' in Figure 1. Although a single value of 
signal intensity was used, the values of d' differed sufficiently from one observer to another to pro- 
vide an indication of the congruence of the data and the predicted functions. (The function predicted 
for the three-choice (or three-interval) situation is included in Fig. 1) to emphasize that this func- 
tion is not the same as the predicted function of the probability of a correct second choice, given an 
incorrect first choice, for the four-choice situation). 

A systematic deviation from the prediction of conventional theory clearly exists. Considering the 
combined data, the proportion of correct second choices is .46. The deviation of this proportion from 
.33 is highly significant; the X* obtained (3.66) is more than twice the X* (19.0) associated with a 
probability of .00001. Allowing for the possibility that being required to make a second choice might 
depress first-choice performance, blocks of 50 observations for which only a first choice was required 
were alternated with blocks of 50 observations for which both a first and second choice were required. 
Pooling the data, the proportions of correct first choices for the two conditions are .650 and .651; 
this difference is obviously not significant. 

The systematic deviation of the second-choice data from the function predicted by the theory pro- 
posed here may be a result of the inadequacy of the assumption of equal variance for N and all values 
of S+N. Any assumption involving a constant ratio of mean to standard deviation would result in lower 
predicted values for proportions of correct second choices. Determining the proportionality of mean and 
standard deviation leading to the most adequate predictions, however, does not fall within the scope of 
this paper. It is clear, nonetheless, that the second-choice data tend to confirm the theory proposed 
in this paper and to differentiate between this theory and conventional theory. 


Conclusions 


The following conclusions are advanced: 

1) The conventional concept of a threshold, or a threshold region, needs re-evaluating in the 
light of these data. 

2) The assumptions underlying the use of the correction for chance successes are rejected on 
the basis of statistical tests. . 

3) Change in neural activity is a power function of change in light intensity. 

4) The model provided by the theory of statistical decision, and, in more detail, by the 
mathematical theory of signal detectability, is applicable to the problems of visual detection. 

5) The criterion of seeing depends on psychological as well as physiological factors. In 
these experiments the observers tended to use optimum criteria. 

6) The experimental data support the logical connection between forced-choice and yes-no tech- 
niques developed by the theory presented here. 

7) A measurable false-alarm rate can be, and should be, produced in yes-no psychophysical 
experiments. : 

8) The forced-choice procedure, which does not necessitate the determination of a criterion, 
should be used whenever possible. 


List of References 


1. Blackwell, H. R., Pritchard, B. S., and Ohmart, J. G. Automatic apparatus for stimulus presenta- 
tion and recording in visual threshold experiments. J. Opt. Soc. Amer., hh, 195. 

2. Peterson, W. W., and Birdsall, T. G., and Fox, W. C. The theory of signal detectability. ~~~‘: 
Transactions of the I.R.E. Professional Group on Information Theory (this issue). 

3- Swets, J. A. An experimental comparison of two theories of visual detection. Unpublished doctoral 
dissertation, University of Michigan, 195). 

4. Tanner, W. P., Jr., and Swets, J. A. A new theory of visual detection, Technical Report No. 18, 
Electronic Defense Group, University of Michigan, 1953. ; 


218 


Fay (A) 


PROPORTION OF POSITIVE RESPONSES 
PROBABILITY DENSITY 


STIMULUS INTENSITY (AT) 


4 Fig. 2 = Hypothetical distributions of 
Fig. 1 - Hypothetical data from detec- noise and signal plus noise. 
tion experiments. 


° 
CRITERION SCALE 


Fig. 4 = Pgy(A) vs. Py(A) with d? as the 
Fig. 3. Poy (A) VS. Py (A) for d'§ sl. parameter. 


8 04 

03 

02 

0.1 

@ ie) 0.2 0.4 0.6 0.8 1.0 

Pry (A) 
Fig. 5 = Poy(A) vs. Py(A) as a Fig. 6 - P(A) as a function of dt 

function of d* assume assuming the statistical - 
ing conventional theory. decision model. 


219 


P(C) 


08 


0.6 


i, Se as la 


04 


0.2 


Fig. 7 - P(C) as a function of a! 
a theoretical curve. 


Fig. 9 = Log d* vs. log signal inten- 
sity for observer 2. 


220 


Fig. 8 = Log d# vs. log signal intens- 
ity for observer 1. 


41 


Fig. 10 = Log d* vs. log signal intens- 
ity for observer 3. 


Poy (A) 


Pgn (A) 


5 6 7 8 9 1.0 
Py (A) 


Py(A) for observer 1. 


Py (a) 


Fig. 12 =. Poy(A) Vs. Py (A) for observer 26 


P sn A 


C Vi | FIXEO CRITERION*®O.33 
iQ. RRS SRO IN NTS os Ge eee cies ees as 
Py A) 
02 | fereleet Hee el { +-——t a 
Fig. 13 = Psn(A) vs. Py(A) for observer 3. 4 alae cap | si ii a tas 
PA Mec 
oO i} 2 S 4 5 6 


Fig. 1) = Second-choice data. 


aon 


THE HUMAN USE OF INFORMATION 


Il. SIGNAL DETECTION FOR THE CASE OF AN UNKNOWN SIGNAL PARAMETER” 


Wilson P. Tanner, Jr. and Robert Z. Norman 
University of Michigan 


Abstract 


Two specific cases of signal detection involving uncertainty in the frequency of a sound signal 
are compared with the case of the signal-kmown-exactly. In the first case the signal is either of two 
known frequencies; in the second case the signal is any frequency within a given range. It is suggest- 
ed that detection behavior that is optimal for the three cases requires a dual mechanism: a combination 
of a wide-open receiver and a panoramic receiver. Evidence is presented that supports the existence of 
such a mechanism. Estimates of the bandwidth and soan-rate of the receiver are included. 


Introduction 


This paper is one of a series in which receiver theory is applied to human sensory behavior. This 
is a logical application for the human sensory systems are, of course, receivers, picking up transmit- 
ted energy and transforming this energy to a useful form. 

The observable aspects of the system are (1) the input to the system, and (2) behavioral acts 
based on an interpretation of: the output. Inferential knowledge of the receiver characteristics can be 
gained by a study of the observable data, with same help from physiological studies of the sensory 
systems of infra-human animals. 

Generally, empirically derived relations between the two sets of observable data have failed to 
furnish an adequate basis for understanding the sensory systems. A more fruitful approach lies in the 
construction of a hypothetical model based on simple assumptions consistent with known physiological 
data. This model must lead to predictions consistent with physiological data, and it must also be 
capable of generating new hypotheses. It is for these reasons that a model based on the theory of 
statistical decision (or the theory of testing statistical hypotheses) was selected as appropriate. 
This model suggests considering the sensory systems as receivers subject to internal noise (an assump- 
tion consistent with physiological data); in addition, this model requires the assumption that the 
output of the sensory systems is treated in an optimum manner (in effect, a new hypothesis). 

The first step in the, development of the statistical decision model for human sensory behavior was 
taken by Tanner and Swets. Their results show that for the case of the signal-known-exactly in visual 
detection the optimization assumption is reasonable. There appears to be a mechanism capable of 
behaving as an hypotheses-testing mechanism which acts on the basis of likelihood ratios at the output 
of the visual pathways. Enough of this experiment has been repeated for the auditory case of ge tecting 
signals in noise so that, when considered in conjunction with the evidence of Smith and Wilson”, the 
model can be considered applicable to audition as well as vision. 

Tanner and Swets were concerned with the case of the signal-lkmown-exactly. This paper is cone 
cerned chiefly with detection for a case where the signal is not known exactly. The particular case is 
that in which there is uncertainty in the frequency of a sound signal which appears in a noise back- 
ground. Two specific cases will be compared with the case of the signal-lmown-exactly: (1) the signal 
is at one of two frequencies with the separation of the frequencies as a parameter, and (2) the signal 
is at any frequency within a given range. 

These three simple detection problems point out that care must be exercised in applying the optin- 
ization assumption. Each situation, considered alone, requires a somewhat different receiver or 
combination of receivers for optimum behavior. It thus becomes apparent that one of the criteria for 
selecting the hypothesized type of optimization is the compatability of the number of different 
receivers required by the particular type of optimization with present knowledge of neurophysiology. 
It is unlikely that a separate receiver exists for every possible laboratory situation. It seems 
necessary, therefore, to try to find a single receiver which is optimum for the three laboratory situa- 
tions outlined above, or, if more than one receiver is to be called into play, to insist that such 
additional mechanisms must be capable of being justified on the basis of more general considerations, 
for example, biological utility. For example, for the three situations of concern in this paper, a 


* This paper is based on work done for the U. S. Army Signal Corps under Contract No. DA-36-039 
sc-15358. 


222 


multiplex receiver that is capable of handling any combination of the three experimental situations 
would be the optimm receiver. The existence of such a receiver, however, is not compatible with the 
data presented below. A panoramic receiver, with a scan rate determined by the task, is a receiver 
near optimum for these three tasks, although this falls considerably below the multiplex receiver for 
the case where the signal is one of two frequencies. Actually, the latter is not a biologically sig- 


nificant case, and consequently should be given a minor role in the application of the optimization 
assumption. 


A Dual Mechanism 


There is a biologically significant case which suggests the possibility of more than one mechan- 
ism for the three cases considered above. This is the case where it is optimal for the enimal to 
attend to specific signals, and, at the same time, to be warned in the event that something occurs 
outside of the range of signals to which he is attending. The attention to a specific signal requires 
a narrow-band receiver, the warning requires a wide-open receiver. A much over=-simplified neural sys- 
tem permitting the operation of such a dual system is illustrated in Figure 1. The columns labled R 
are the receptors A, B, ©, D, E. The columns labeled N are neurons A, B, C, D, E, and W. If a signal 
is detected at the output of a neuron (A to E) the receptor from which the signal originated is know. 
If one is detected at the output of W, however, the only information is that a signal exists, origin- 
ating from at least one of the receptors. The information from W is that provided by a wide-open 
receiver; from A to E the information ig that provided by narrow-band receivers. 

Now, if the animal is attending to E, for example, occasional inspection of W may serve to deter- 
mine the existence of signals other than those originating at E. If one exists, then A to D can be 
considered individually. This arrangement may be far more efficient than examining A to D period- 
ically for the warning. It may thus be reasonable to look for a dual mechanism, a combination of a 
wide-open and a panoramic receiver. If this ig the case, attention should be divided between W and 
the center of attention, depending on the a priori probabilities of signals over A to E. If there is 
no probability of signals other than those to which attention is directed, then W should be ignored. 
If two signals are sufficiently close together, such that the panoramic receiver (with controlled scan 
range) offers the better probability of detection, then again W should not be observed. If the two 
signals are farther apart, then either a wide-open receiver or some combination of a wide-open and a 
fixed-tuned receiver (panoramic, with sero range) should be called into play. 


\ 


The Bendwidth Problem 


From the above discussion it is apparent that one of the variables relevant to the problem is the 
bandwidth of the receiver in operation. Several writers have reported data bearing on the bandwidth 
question. In general, there is good agreement on this subject. The results of these studies are re- 
produced in Figure 2, taken from Licklider.® These studies were all performed under different 
experimental conditions; all, however, for the case of the signal-lnown-exactly. The material pre- 
sented below is in agreement with the data represented in Figure 2 to the extent that bandwidth is 
regarded as a similar function of frequency. This paper, however, differs with respect to estimates 
of the width of the band. The estimate of bandwidth represented in Figure 2 depends upon an arbitrary 
assumption that a signal is just audible when its acoustic power is the same as that of the masking 
noise; this assumption is not suscribed to here. 

Green* has recently completed two studies in the Electronic Defense Group laboratory which bear 
on this problem. The first of these studies involved the comparison of an inferred d' with a calcu- 
lated ideal based on a 10 ops bandwidth.* The principle involved in the study is illustrated in 
Figure 3. For low values of signal-to-noise ratio (S/N), d' varies as a power of S/N, and is less 
than the calculated ideal. As s/i increases, d' rapidly approaches the calculated ideal at a d' of 
the order of 5. d' cannot, of course, exceed si. This suggests that the bandwidth may be as narrow 
as 10 cps. 

Green's second study involves the problem of matching bandwidth to signal duration. For dura- 
tions less than about .08 second at 1000 cps, d' is a linear fumotion of signal duration, t. For 
durations greater than about .08 second, d' varies as /t-* Thus, 12.65 cps appear to be the maximm 
bandwidth. Now, suppose the observer knows a signal is .2 seconds in duration. If he is still oper- 
ating with a 12.5 ops bandwidth, signals of .1 second in duration, introduced without his knowledge, 
should result in a d' which is .707 of the same signal at .2 second. If, however, he has matched his 
bandwidth, narrowing it to 5 cps, then the signal of .1 seoond duration, again introduced without 


* For the definition of d' see Tanner and Suctaie 


223 


knowledge, should yield a d' of 1/2 that of the same signal energy .2 second in duration. In an ex- 
ploratory experiment, the d' for the .1 second signal was observed to be exactly 1/2 that of the .2 
second signal. The likelihood ratio comparing the matched bandwidth against the fixed bandwidth was 
3:1 in favor of the matched bandwidth, suggestive but scarcely conclusive. The difference between the 
fixed-bandwidth prediction and the experimental result is significant at the ten percent confidence 
level. This work is being continued. 

In general, Green's work shows that the maximum possible bandwidth is a logarithmically increas- 
ing function of frequency. The nature of this function, for the range of frequencies investigated 
(500 cps - 4000 cps), can be described approximately by the equation 


w = 10x¢ (1) 
where w is an inferred measure of bandwidth, f is frequency, and k is an individual constant. 


The Experiments 


The experiments to test the mechanism described above were designed to answer the following ques- 
tions. 1) Can the hearing mechanism act as a fixed-tuned receiver? 2) When two signals are separated 
in frequency is it possible to listen for both at the same time? 3) Is the scamming hypothesis 
feasible? 4) Are there situations in which the data can best be described in terms of a wide-open 
receiver? The procedures for testing were the same throughout. 


Procedure 

A forced-choice experimental technique was used. All programming was carried out by N. P. Psytar® 
(Noise Programmed PSYchophysical Testing And Recording). The observers listened with Permoflux PDR-8 
cushioned headphones to a signal presented by a tone burst generator in a background of white noise. 
The signal occurred simultaneously with one of four flashes of a neon bulb, and the observer's task was 
to state with which of the flashes the signal occurred. Wherever the experiment involved a comparison, 
such as that between the signal-known-exactly and the signal known to be one of two frequencies, the 
comparison was based on a single day's data if at all possible. 


Experimental Evidence 

The Ability to Act as a Fixed-Tuned Receiver. The first, and simplest, experiment merely in- 
volves the ability of the observer to tune to a specific frequency to the exclusion of others. The 
training period, during which the observers became acquainted with the apparatus and became used to 
listening for signals in noise, was conducted employing only a 1000 cps tone burst .143 second in 
duration. When they had progressed sufficiently so that no further learning effects were anticipated, 
the frequency of the tone was switched to 1300 cps, at the same energy level which in the noise 
background yielded a P(C) ( probability of correct choice) of approximately .65 at 1000 cps. The 
observers were not informed of the change. P(C) for the four observers was approximately chance. They 
insisted that the experimenter had forgotten to turn on the signal generator. Later tests showed that 
when they knew the frequency, the P(C) at that signal level, noise level, and frequency (1300 cps) was 
again approximately .65. It is apparent from this experiment that the hearing mechanism can act as a 
narrow-band receiver. Unfortunately the nature of the experiment is such that a systematic set of 
similar experiments (varying the frequency difference between the expected and unexpected signals) is 
impossible with a single set of observers. 

Simultaneous vs. Successive Observation. For the case where the signal is known to be one of two 
frequencies, different hypotheses lead to different predictions of detection rates. Figure 3 shows 
the predictions based on simultaneous observation and on successive observation compared to the case 
of the signal-known-exactly. The curve for simultaneous observation assumes the signals are suffic- | 
iently separated in frequency to be clearly resolved by the receiver, while the curve for successive 
observation assumes a rectilinear passband, such that signals outside of the bend are infinitely 
attenuated. Both curves are probably a little lower than they should be. 

In a series of experiments in which the two frequencies were below 2000 ops and were separated by 
from 200 cps to 800 cps, the results suggest that for all of the separations and for durations of 
about .05 second, simultaneous observation is impossible. Only in a few individual experiments were 
the results consistent with simultaneous observation, and these few could have occurred on a chance 
basis if the successive-observation hypothesis holds. 

Evidence for the Scanning Hypothesis. The comparison, however, is a function of the signal dur- 
ation. If two experiments are run comparing detection for the case of a signal with kmown frequency 
versus one of two known frequencies, and all of the parameters of the two experiments are the same 
except signal duration, then performance with the shorter duration might be expected to suffer more 
from the lack of knowledge in the two-frequency cases. Such experiments were conducted using fre- 
quencies of 400 cps and 1000 cps. The signal durations were .05 and .2 seconds. In each case, for a 


22h 


single frequency, the signal-noise ratio was adjusted for a P(C) of .8. When the frequency was one of 
two known values, detection for the .2 second duration was significantly greater than for the .05 
second duration. This supports the scanning hypothesis. 

Evidence for the Wide-Open Receivers. If the observer knows only that the signal will be in a 
given frequency range, and this frequency is varied randomly from trial to trial, the probability that 
the observer is looking for the signal frequency at the time of its occurrence is very low. If the 
auditory sensory system acts like the narrow-band receiver described in the experiments reported above, 
P(C) should drop to about .25, the chance probability. Our observations indicate that this is not 
quite true, although the detection rate does drop well below that for the case of one of two known fre- 
quencies. In this case, the hearing mechanism appears to act as a wide-open receiver, not nearly as 
sensitive as the narrow band receiver because it is open to noise as well as to the signal. It is this 
case, along with the biological utility of such a mechanism, that leads to the inclusion of a wide-open 
receiver, necessitating the postulation of a dual mechanism. 

An Estimate of Attainable Scan Speeds. The experiments reported above support the hypothesis that 
the hearing mechanism is indeed a dual mechanism. One part of the mechanism operates as.a wide-open 
receiver, while the other operates as a panoramic receiver. These experiments, however, tell little 
about the parameters of the panoramic receiver. Apparently it can scan either at 0 speed (fixed tuned 
as in the case of the initial experiment reported) or at some speed greater than 0. If one is willing 
to make certain assumptions, it is possible to say a little more about the scan-rate parameter. For 
example, if a linear scan covering the range determined by the two frequencies is assumed, it is 
possible to estimate scan rate. While this assumption is probably not realistic, it is made here SOS 
the purpose of presenting some preliminary caloulations of scan-rate. 

Two experiments were involved in this study. In each experiment the signal could ocour anywhere 
within two frequencies: 400-1100 cps and 1000-1700 ops, respectively. P(C) for signals in each range 
were determined for signals of known frequency of .1 second duration. Then the unknown frequency ex- 
periment was done, increasing duration until the known P(C) was again achieved. In the lower frequency 
renge it was necessary to increase the signal to approximately .3 second to achieve this detection 
level, while in the higher range it was necessary to increase the duration to approximately .2 second. 
Thus, the high frequency range, which is 700 cps wide, can apparently be scanned at a rate approximate- 
ly 1.5 times the scan rate in the lower frequency range, which is also 700 cps wide. 

Thus, in the low range it is possible to scan over the frequency range (700 cps) in something like 
2 second, and at the higher range in approximately .1 second. Assuming linearity over the range only, 
the scan-rate is 3500 cycles per second per second in the lower frequency range, while in the higher 
range it is 7000 cycles per second per second. Arbitrarily assuming that these are the rates for 700 
cps and 1400 ops (approximately the mid-frequencies of the two ranges ) the scan-rate may be approxi- 
mated by the equation 


af 
7 5f. (2) 


The rate of change in scan-rate thus appears to be a linear funotion of frequency. — 


Conclusions 


The hearing mechanism is treated as a dual mechanism and experimental evidence is presented 
supporting the feasibility of this treatment. The two components of the mechanism are 1) a narrow-band 
panoramic type receiver, and 2) a wide-open receiver. The employment of these receivers is under con- 
trol of the individual and dependent upon the type of task he is asked to perform. When frequency 
information is either available, or required, the narrow-band receiver is used. When one is trying to 
detect only the presence of a signal, the wide-open receiver is employed. 

The experiments performed involved signals at a level seldom significant in real life situations. 
This was necessary for the purpose of the study. It also leads to some statements that, at first 
glance, suggest the existence of behavior that is biologically detrimental. For example, the ability 
to attend to a single frequency to the exclusion of others differing by a relatively few cycles could 
lead to disastrous events. The fact that the level of signal employed is so low leads to this result. 
Signals at higher levels either may not be attenuated to so great a degree, or may be sufficient in 
amplitude for detection with occasional reference to the wide-open receiver, particularly if they are 
of sufficient duration to be significant to the individual. 

Another problem arises, and this is the perception of speech. For speech to be perceived, the 
panoramic receiver is required. The antics it must perform if it is to follow the sound frequency 
patterns are scarcely imaginable. The obvious conclusion is that the receiver does not follow these 
sound patterns exactly. It searches on the basis of conditional probabilities, frequently failing in 


225 


the search. When it fails, the undetected frequencies are filled in on the basis of a posteriori 
probabilities. It is for this reason that a phoneme improperly used or improperly articulated may not 
be detected, and a more likely phoneme substituted in its place by the listener. When substitutions 
of this sort are made, the listener is usually convinced that he heard the more likely phoneme. He 
seems to be umaware of having made a correction. 

There is still a great deal of work necessary to complete the picture. Some of this work is in 
progress, including parallel studies in the visual area to see if the dual mechanism best describes 
the case where signal location is unknown. The problem is complicated by the possibility of a non- 
linear scan, and further progress depends on determining the nature of the non-linearity and more 
precise information on scan-rates. 


References 


1. Tanner, W. P., Jro, and Swets, J. A., "The Human Use of Information. I. Signal Detection for the 


Case of the Signal-Known-Exactly." Transactions of the I.R.E. Professional Group on Information 
Theory (This issue). 


2. Smith, M., and Wilson, Edna A., "A Model of the Auditory Threshold and Its Application to the 
Problem of the Multiple Observer." Psychol. Monog., Vol. 67, No. 9, 1953. 


3. Licklider, J. C. R., "Basic Correlates of the Auditory Stimulus," in S. S. Stevens (Ed.), Handbook 
of Experimental Psychology » New York: Wiley, 1951. 


4. Green, D. M., "Signal Deteotion as a Function of Frequency and Duration," In Technical Report No. 
30, Electronic Defense Group, University of Michigan (in preparation). 


226 


Figure 1 
A Simplified Model 
of the Hearing Mechanism. 


Figure 3 
Human Observer Compared to Ideal Observer. 


IN CPS 


Af 


100 1000 10,000 
FREQUENCY IN CPS 


° o MASKING 
a @ FREQUENCY DISCRIMINATION 
----— PITCH SCALE 


eceeeeeoeee oe INTELLIGIBILITY 


Figure 2 
Estimate of Bandwidth as a Function of 
Frequency. (Taken from Licklider®) 


SIGNAL ONE OF TwO 
SIMULTANEOUS OBSERVATION 


SIGNAL ONE OF TWO 
SUCCESSIVE OBSERVATION 


Figure 4 
Case of Simultaneous and Successive 
Observation Compared to the Case of 
The Signal Known Exactly. 


227 


NOTES 


228 


NOTES 


229 


NOTES 


230 


NOTES 


231 


NOTES 


233 


= 
“ 5 


ad 


ee 


oe 


” 


